Motivation
As I mentioned before in the rental room prices in Spanish municipalities post, the Spanish real estate market can be intense.
For citizens: Is hard to make informed decisions when house buying, real estate investment or even house rental.
For big companies: Trust funds, real estate companies, banks… it is so easy. They have or could buy the data and insights to make the right investment decision.
It is not a fair competition.
Overview
Idealista.com is the de facto standard for searching for a house or other real estate product. It is the biggest on-line house marketplace in Spain, even the real estate agencies advertise their products on this web-page.
They have plenty of data which of course, they don’t share for free. They only have a few and not very detailed reports. And a very limited API that you first have to request for access, and don’t always get a response.
Because of that, I decided to open source the python scraper that I have used to understand the Spanish real estate market and make informed house investments; I called it Dedomeno and you can find it at:
https://github.com/ginopalazzo/dedomeno
Technology
The technology behind Dedomeno is largely based on the most common Python frameworks:
- Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.
- Scrapy - An open source and collaborative framework for extracting the data you need from websites.
- Celery - An asynchronous task queue/job queue based on distributed message passing.
Features
The first Dedomeno version was intended to be a homemade crawler for idealista.com, but over time, in order to automatize all the processes, it become more and more complex (not an exhaustive list):
- Crawler (Scrapy)
- Proxy Rotator Middleware not to get banned.
- Proxy banned alert Middleware (remove proxy and send an email).
- Statistical random User Agent Middleware (with file backup).
- Django Postgresql database to store all types of real estate properties.
- Celery task integrated with Django db.
- Flower to monitor the celery workers.
- Idealista.com features
- Scrape, clean and store 7 types of real estate properties (houses, offices, rooms…).
- Scrape all real estate companies information and related it to the properties.
- Check for off-line properties. Date of on-line and off-line.
- Check for price changes.
- Geocode each property:
- long, lat.
- street, city/town, province.
- standard geocode for province, city, district, neighborhood.
- Make a Spanish geocode tree to make quick operations with sets.
- Web page (Django server)
- Real Estate companies information.
- Download the data.
- View property details.
- Spanish map view with average prices.
How to use it
There are 2 ways of using Dedomeno:
Manual
Use crawlpropery.py
to make manual crawls, just modify the CrawlPropertyReactor parameters to your needs:
Programmatic
Use django + celery + flower + scrapy to do programmatic crawls and visualize the output:
- start django server, http://127.0.0.1:8000/
python manage.py runserver
- remove all tasks from queue: celery (Only works when RabbitMQ is up)
celery -A proj purge
- Enter admin http://127.0.0.1:8000/admin and change
django-celery-beat
to schedule the periodic task in the db - Run the RabbitMQ message broker
sudo rabbitmq-server
- Celery:
- Start the celery worker
celery -A dedomeno worker --loglevel=INFO
- Run Flower, a web based tool for monitoring and administrating Celery clusters
celery -A dedomeno flower
- Start the celery beat (schedule tasks)
celery -A dedomeno beat -l info -S django
- Start the celery worker
Next steps
Plenty of next steps, I could use some help ;) :
- Decouple source. Now Dedomeno only takes idealista.com information and is highly dependent on idealista.
- Add more sources (i.e. fotocasa, the second player).
- Alerts when idealista makes major and minor changes.
- Comprehensive tests.
- Better map visualization.
- Mix idealista, fotocasa information with official data.
- …
Conclusion
Dedomeno works fine to crawl idealista.com in both, a manual and programmatic way.
But that’s it.
It hasn’t been tested in a production environment with multiple crawler threads and it is very sensitive to changes made by Idealista. Also it is not well designed to have more sources.
But it’s a start, happy scraping!