First steps
We will use Django
to easily access and store (maybe present too) the data that we want to extract from the web with Scrapy
. We also need the extension scrapy-djangoitem
that will connect the data model of the two frameworks. The scrapy project should be inside the django project at the same level of an app.
If you are reading this post I suppose that you have a basic knowledge of how Django and Scrapy works. You can read further in the following links:
- Install django:
pip install django
- New django project:
django-admin startproject mysite
- New django app:
python manage.py startapp polls
- Install scrapy:
pip install Scrapy
- Install scrapy-djangoitem:
pip install scrapy-djangoitem
- New scrapy project & spider:
pip install django
Django configuration
Make sure you add to the django app and the scrapy project to the settings of your django project django-project/django-project/settings.py
:
Scrapy configuration
We need to add an empty init.py file in django-project/scrapy-project/__init__.py
and django-project/scrapy-project/spiders/__init__.py
in order for django to recognize the scrapy project as a package.
We will also modify our django-project/scrapy-project/settings.py
as the following, so django and scrapy can communicate:
django-project/scrapy-project/apps.py
needs to be created in order to simulate scrapy project as a django app:
In django-project/scrapy-project/items.py
we create the scrapy items which are inherited from DjangoItem and we assign them to the django model in the django_model
variable:
Now the DjangoItems are ready to use in the spider. Also we could access the Django model through the Django ORM (delete, create, queries…). django-project/scrapy-project/spiders/spider.py
Usually you are going to use an Item Pipeline for cleaning, checking duplicates and storing the item in the database (though you could also do it in the spider depending on your needs) django-project/scrapy-project/spiders/pipelines.py
:
Also, you can interact with your Scrapy DjangoItem object and also with your Django object only with the appropriate imports as shown.
Last thing to do, is enable our pipelines in your scrapy settings file django-project/scrapy-project/settings.py
HAPPY SCRAPING!
I’ll try to do more posts to explore the following further:
Usually you would use the scrapy command line to start the crawler. But if you want to do recurrent scraping in the same doming probably you would have to:
- Programmatically: use asynchronous messaging system like Celery and a broker like RabbitMQ
- From views: triggering spiders from Django views with ‘python-scrapyd-api’, as explain in How to use Scrapy with Django Application