Combine Scrapy and Selenium: A Quick Guide

While Scrapy is excellent for static sites, it cannot execute JavaScript. Many modern websites load content dynamically using JavaScript. To scrape these sites, we can integrate Scrapy with Selenium.

When to Use This Integration?

Use this integration when:

The data you need is loaded via JavaScript (AJAX).
You need to interact with the page (click buttons, scroll) to reveal content.
The site uses complex anti-scraping measures that require a real browser fingerprint.

Setup

First, install the necessary packages:

pip install scrapy selenium

You will also need a WebDriver for your browser (e.g., ChromeDriver).

Implementation Strategy

The most common way to integrate them is to use a Downloader Middleware. This middleware intercepts the request from Scrapy, uses Selenium to load the page, and then returns the HTML content back to Scrapy as a response.

1. Create the Middleware

In your middlewares.py file:

# middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


class SeleniumMiddleware:
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument("--headless")  # Run in headless mode
        self.driver = webdriver.Chrome(options=chrome_options)

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        # Only use Selenium for requests with a specific meta key
        if request.meta.get('selenium'):
            self.driver.get(request.url)

            # You can add waits or interactions here
            # self.driver.implicitly_wait(5) 

            body = self.driver.page_source
            return HtmlResponse(
                self.driver.current_url,
                body=body,
                encoding='utf-8',
                request=request
            )
        return None

    def spider_closed(self):
        self.driver.quit()

2. Enable the Middleware

In your settings.py, enable the middleware:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.SeleniumMiddleware': 543,
}

3. Use it in Your Spider

Now, in your spider, you can pass meta={'selenium': True} to requests that need Selenium:

# spiders/dynamic_spider.py
import scrapy


class DynamicSpider(scrapy.Spider):
    name = "dynamic"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.com/dynamic-content",
            meta={'selenium': True},
            callback=self.parse
        )

    def parse(self, response):
        # Now response.body contains the HTML rendered by Selenium
        title = response.css("h1::text").get()
        yield {'title': title}

Pros and Cons

Pros: Allows scraping of any website, regardless of JavaScript.
Cons: Significantly slower than pure Scrapy. You lose the speed benefit of Scrapy's async architecture for these requests.

Next Steps

In the next article, we will look at how to integrate Scrapy with Playwright, a modern alternative to Selenium.

Using Scrapy and Selenium Together: A Step-by-Step Guide

When to Use This Integration?

Setup

Implementation Strategy

1. Create the Middleware

2. Enable the Middleware

3. Use it in Your Spider

Pros and Cons

Next Steps

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

Step-by-Step Guide to Using Scrapy with Playwright

More from this blog

The Physics of Resistance: Suresh the Security Guard and Ohm's Law

How to Avoid Bot Detection Using Scrapy and Playwright

How to Use Scrapy for Stealthy Web Scraping Without Getting Caught

The Ultimate Decision Guide: Scrapy vs. Playwright vs. Selenium vs. Proxies

Essential AI Prompts to Boost Your Scrapy Development

Command Palette

When to Use This Integration?

Setup

Implementation Strategy

1. Create the Middleware

2. Enable the Middleware

3. Use it in Your Spider

Pros and Cons

Next Steps

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

Step-by-Step Guide to Using Scrapy with Playwright

More from this blog