Using Scrapy and Selenium Together: A Step-by-Step Guide
While Scrapy is excellent for static sites, it cannot execute JavaScript. Many modern websites load content dynamically using JavaScript. To scrape these sites, we can integrate Scrapy with Selenium.
When to Use This Integration?
Use this integration when:
The data you need is loaded via JavaScript (AJAX).
You need to interact with the page (click buttons, scroll) to reveal content.
The site uses complex anti-scraping measures that require a real browser fingerprint.
Setup
First, install the necessary packages:
pip install scrapy selenium
You will also need a WebDriver for your browser (e.g., ChromeDriver).
Implementation Strategy
The most common way to integrate them is to use a Downloader Middleware. This middleware intercepts the request from Scrapy, uses Selenium to load the page, and then returns the HTML content back to Scrapy as a response.
1. Create the Middleware
In your middlewares.py file:
# middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class SeleniumMiddleware:
def __init__(self):
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in headless mode
self.driver = webdriver.Chrome(options=chrome_options)
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
return middleware
def process_request(self, request, spider):
# Only use Selenium for requests with a specific meta key
if request.meta.get('selenium'):
self.driver.get(request.url)
# You can add waits or interactions here
# self.driver.implicitly_wait(5)
body = self.driver.page_source
return HtmlResponse(
self.driver.current_url,
body=body,
encoding='utf-8',
request=request
)
return None
def spider_closed(self):
self.driver.quit()
2. Enable the Middleware
In your settings.py, enable the middleware:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.SeleniumMiddleware': 543,
}
3. Use it in Your Spider
Now, in your spider, you can pass meta={'selenium': True} to requests that need Selenium:
# spiders/dynamic_spider.py
import scrapy
class DynamicSpider(scrapy.Spider):
name = "dynamic"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/dynamic-content",
meta={'selenium': True},
callback=self.parse
)
def parse(self, response):
# Now response.body contains the HTML rendered by Selenium
title = response.css("h1::text").get()
yield {'title': title}
Pros and Cons
Pros: Allows scraping of any website, regardless of JavaScript.
Cons: Significantly slower than pure Scrapy. You lose the speed benefit of Scrapy's async architecture for these requests.
Next Steps
In the next article, we will look at how to integrate Scrapy with Playwright, a modern alternative to Selenium.