Motivation
As I mentioned before in the rental room prices in Spanish municipalities post, the Spanish real estate market can be intense.
For citizens: Is hard to make informed decisions when house buying, real estate investment or even house rental.
For big companies: Trust funds, real estate companies, banks… it is so easy. They have or could buy the data and insights to make the right investment decision.
It is not a fair competition.
Overview
Idealista.com is the de facto standard for searching for a house or other real estate product. It is the biggest on-line house marketplace in Spain, even the real estate agencies advertise their products on this web-page.
They have plenty of data which of course, they don’t share for free. They only have a few and not very detailed reports. And a very limited API that you first have to request for access, and don’t always get a response.
Because of that, I decided to open source the python scraper that I have used to understand the Spanish real estate market and make informed house investments; I called it Dedomeno and you can find it at:
https://github.com/ginopalazzo/dedomeno
Technology
The technology behind Dedomeno is largely based on the most common Python frameworks:
- Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.
 - Scrapy - An open source and collaborative framework for extracting the data you need from websites.
 - Celery - An asynchronous task queue/job queue based on distributed message passing.
 
Features
The first Dedomeno version was intended to be a homemade crawler for idealista.com, but over time, in order to automatize all the processes, it become more and more complex (not an exhaustive list):
- Crawler (Scrapy)
    
- Proxy Rotator Middleware not to get banned.
 - Proxy banned alert Middleware (remove proxy and send an email).
 - Statistical random User Agent Middleware (with file backup).
 - Django Postgresql database to store all types of real estate properties.
 - Celery task integrated with Django db.
 - Flower to monitor the celery workers.
 
 - Idealista.com features
    
- Scrape, clean and store 7 types of real estate properties (houses, offices, rooms…).
 - Scrape all real estate companies information and related it to the properties.
 - Check for off-line properties. Date of on-line and off-line.
 - Check for price changes.
 - Geocode each property:
        
- long, lat.
 - street, city/town, province.
 - standard geocode for province, city, district, neighborhood.
 
 - Make a Spanish geocode tree to make quick operations with sets.
 
 - Web page (Django server)
    
- Real Estate companies information.
 - Download the data.
 - View property details.
 - Spanish map view with average prices.
 
 
How to use it
There are 2 ways of using Dedomeno:
Manual
Use crawlpropery.py to make manual crawls, just modify the CrawlPropertyReactor parameters to your needs:
# dedomeno/idealista/crawlproperty.py
if __name__ == "__main__":
    spider = CrawlPropertyReactor(property_type='land',
             transaction='sale', provinces=['salamanca'])
    spider.conf()
    spider.run()Programmatic
Use django + celery + flower + scrapy to do programmatic crawls and visualize the output:
- start django server, http://127.0.0.1:8000/
    
python manage.py runserver
 - remove all tasks from queue: celery (Only works when RabbitMQ is up)
    
celery -A proj purge
 - Enter admin http://127.0.0.1:8000/admin and change 
django-celery-beatto schedule the periodic task in the db - Run the RabbitMQ message broker
    
sudo rabbitmq-server
 - Celery:
    
- Start the celery worker
        
celery -A dedomeno worker --loglevel=INFO
 - Run Flower, a web based tool for monitoring and administrating Celery clusters
        
celery -A dedomeno flower
 - Start the celery beat (schedule tasks)
        
celery -A dedomeno beat -l info -S django
 
 - Start the celery worker
        
 
Next steps
Plenty of next steps, I could use some help ;) :
- Decouple source. Now Dedomeno only takes idealista.com information and is highly dependent on idealista.
 - Add more sources (i.e. fotocasa, the second player).
 - Alerts when idealista makes major and minor changes.
 - Comprehensive tests.
 - Better map visualization.
 - Mix idealista, fotocasa information with official data.
 - …
 
Conclusion
Dedomeno works fine to crawl idealista.com in both, a manual and programmatic way.
But that’s it.
It hasn’t been tested in a production environment with multiple crawler threads and it is very sensitive to changes made by Idealista. Also it is not well designed to have more sources.
But it’s a start, happy scraping!