Skip to main content

Command Palette

Search for a command to run...

Using Scrapy and Selenium Together: A Step-by-Step Guide

Updated
•2 min read

While Scrapy is excellent for static sites, it cannot execute JavaScript. Many modern websites load content dynamically using JavaScript. To scrape these sites, we can integrate Scrapy with Selenium.

When to Use This Integration?

Use this integration when:

  • The data you need is loaded via JavaScript (AJAX).

  • You need to interact with the page (click buttons, scroll) to reveal content.

  • The site uses complex anti-scraping measures that require a real browser fingerprint.

Setup

First, install the necessary packages:

pip install scrapy selenium

You will also need a WebDriver for your browser (e.g., ChromeDriver).

Implementation Strategy

The most common way to integrate them is to use a Downloader Middleware. This middleware intercepts the request from Scrapy, uses Selenium to load the page, and then returns the HTML content back to Scrapy as a response.

1. Create the Middleware

In your middlewares.py file:

# middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


class SeleniumMiddleware:
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument("--headless")  # Run in headless mode
        self.driver = webdriver.Chrome(options=chrome_options)

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        # Only use Selenium for requests with a specific meta key
        if request.meta.get('selenium'):
            self.driver.get(request.url)

            # You can add waits or interactions here
            # self.driver.implicitly_wait(5) 

            body = self.driver.page_source
            return HtmlResponse(
                self.driver.current_url,
                body=body,
                encoding='utf-8',
                request=request
            )
        return None

    def spider_closed(self):
        self.driver.quit()

2. Enable the Middleware

In your settings.py, enable the middleware:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.SeleniumMiddleware': 543,
}

3. Use it in Your Spider

Now, in your spider, you can pass meta={'selenium': True} to requests that need Selenium:

# spiders/dynamic_spider.py
import scrapy


class DynamicSpider(scrapy.Spider):
    name = "dynamic"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.com/dynamic-content",
            meta={'selenium': True},
            callback=self.parse
        )

    def parse(self, response):
        # Now response.body contains the HTML rendered by Selenium
        title = response.css("h1::text").get()
        yield {'title': title}

Pros and Cons

  • Pros: Allows scraping of any website, regardless of JavaScript.

  • Cons: Significantly slower than pure Scrapy. You lose the speed benefit of Scrapy's async architecture for these requests.

Next Steps

In the next article, we will look at how to integrate Scrapy with Playwright, a modern alternative to Selenium.

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.