Dedomeno: A Spanish real estate (Idealista) python scraper

Dedomeno is a python scraper with Django a Scrapy for idealista.com and, hopefully more.

Python Data

Motivation

As I mentioned before in the rental room prices in Spanish municipalities post, the Spanish real estate market can be intense.

For citizens: Is hard to make informed decisions when house buying, real estate investment or even house rental.

For big companies: Trust funds, real estate companies, banks… it is so easy. They have or could buy the data and insights to make the right investment decision.

It is not a fair competition.

Overview

Idealista.com is the de facto standard for searching for a house or other real estate product. It is the biggest on-line house marketplace in Spain, even the real estate agencies advertise their products on this web-page.

They have plenty of data which of course, they don’t share for free. They only have a few and not very detailed reports. And a very limited API that you first have to request for access, and don’t always get a response.

Because of that, I decided to open source the python scraper that I have used to understand the Spanish real estate market and make informed house investments; I called it Dedomeno and you can find it at:

https://github.com/ginopalazzo/dedomeno

Technology

The technology behind Dedomeno is largely based on the most common Python frameworks:

Features

The first Dedomeno version was intended to be a homemade crawler for idealista.com, but over time, in order to automatize all the processes, it become more and more complex (not an exhaustive list):

How to use it

There are 2 ways of using Dedomeno:

Manual

Use crawlpropery.py to make manual crawls, just modify the CrawlPropertyReactor parameters to your needs:

# dedomeno/idealista/crawlproperty.py

if __name__ == "__main__":
    spider = CrawlPropertyReactor(property_type='land',
             transaction='sale', provinces=['salamanca'])
    spider.conf()
    spider.run()

Programmatic

Use django + celery + flower + scrapy to do programmatic crawls and visualize the output:

  1. start django server, http://127.0.0.1:8000/
    • python manage.py runserver
  2. remove all tasks from queue: celery (Only works when RabbitMQ is up)
    • celery -A proj purge
  3. Enter admin http://127.0.0.1:8000/admin and change django-celery-beat to schedule the periodic task in the db
  4. Run the RabbitMQ message broker
    • sudo rabbitmq-server
  5. Celery:
    1. Start the celery worker
      • celery -A dedomeno worker --loglevel=INFO
    2. Run Flower, a web based tool for monitoring and administrating Celery clusters
      • celery -A dedomeno flower
    3. Start the celery beat (schedule tasks)
      • celery -A dedomeno beat -l info -S django

Next steps

Plenty of next steps, I could use some help ;) :

Conclusion

Dedomeno works fine to crawl idealista.com in both, a manual and programmatic way.

But that’s it.

It hasn’t been tested in a production environment with multiple crawler threads and it is very sensitive to changes made by Idealista. Also it is not well designed to have more sources.

But it’s a start, happy scraping!