Django + Scrapy

How I have integrated Django with Scrapy (not necessarily the best solution).

Python Data

First steps

We will use Django to easily access and store (maybe present too) the data that we want to extract from the web with Scrapy. We also need the extension scrapy-djangoitem that will connect the data model of the two frameworks. The scrapy project should be inside the django project at the same level of an app.

If you are reading this post I suppose that you have a basic knowledge of how Django and Scrapy works. You can read further in the following links:

# How the tree directory should look like
...
├── requirements.txt
├── manage.py
├── django-project
│   ├── django-project
│   │   ├── settings.py
│   │   ├── urls.py
│   │   ├── views.py
│   │   └── wsgi.py
│   ├── example-app
│   │   ├── admin.py
│   │   ├── apps.py
│   │   ├── migrations
│   │   ├── models.py
│   │   ├── static
│   │   │   ├── ...
│   │   ├── templates
│   │   │   ├── ...
│   │   ├── tests.py
│   │   ├── urls.py
│   │   ├── views.py
│   ├── scrapy-proyect
│   │   ├── ..  
│   │   ├── apps.py
│   │   ├── crawlproperty.py
│   │   ├── geckodriver.log
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   ├── spiders
│   │   │   ├── spider.py
...

Django configuration

Make sure you add to the django app and the scrapy project to the settings of your django project django-project/django-project/settings.py:

INSTALLED_APPS = [
    'example-app.apps.ExampleAppConfig',
    'scrapy-proyect.apps.ScrapyProyectConfig',
    ...
]

Scrapy configuration

We need to add an empty init.py file in django-project/scrapy-project/__init__.py and django-project/scrapy-project/spiders/__init__.py in order for django to recognize the scrapy project as a package.

We will also modify our django-project/scrapy-project/settings.py as the following, so django and scrapy can communicate:

import sys
import os
import django
# ------------ DJANGO SETTINGS ------------
sys.path.append('../dedomeno')
os.environ['DJANGO_SETTINGS_MODULE'] = 'django-project.settings'
django.setup()  # for > django 1.7

BOT_NAME = 'scrapy-project'

SPIDER_MODULES = ['scrapy-project.spiders']
NEWSPIDER_MODULE = 'scrapy-project.spiders'
...

django-project/scrapy-project/apps.py needs to be created in order to simulate scrapy project as a django app:

from django.apps import AppConfig

class ScrapyPrpjectConfig(AppConfig):
    name = 'scrapy-project'

In django-project/scrapy-project/items.py we create the scrapy items which are inherited from DjangoItem and we assign them to the django model in the django_model variable:

import scrapy
from scrapy_djangoitem import DjangoItem
from example-app.models import Xxxx, Yyyy

class XxxxItem(DjangoItem):
    django_model = Xxxx

class YyyyItem(DjangoItem):
    django_model = Yyyy

Now the DjangoItems are ready to use in the spider. Also we could access the Django model through the Django ORM (delete, create, queries…). django-project/scrapy-project/spiders/spider.py

# Scrapy Items imports
from scrapy-project.items import XxxxItem, YyyyItem
# Django model imports
from example-app.models import Xxxx, Yyyy

Usually you are going to use an Item Pipeline for cleaning, checking duplicates and storing the item in the database (though you could also do it in the spider depending on your needs) django-project/scrapy-project/spiders/pipelines.py:

from scrapy-project.items import XxxxItem, YyyyItem
from example-app.models import  Xxxx, Yyyy

class XxxxPipeline(object):
    def process_item(self, item, spider):
        # the cleaning magic starts!
        ....
        # finish!
        item.save()
        return item
...

Also, you can interact with your Scrapy DjangoItem object and also with your Django object only with the appropriate imports as shown.

Last thing to do, is enable our pipelines in your scrapy settings file django-project/scrapy-project/settings.py

# Configure item pipelines
#ITEM_PIPELINES = {
    'scrapy_project.pipelines.XxxxPipeline': 200,
    ...
#}

HAPPY SCRAPING!

I’ll try to do more posts to explore the following further:

Usually you would use the scrapy command line to start the crawler. But if you want to do recurrent scraping in the same doming probably you would have to: