Using Scrapy and Selenium Together: A Step-by-Step Guide
Iâm Ravikirana B â an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that donât just workâthey make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What youâll find here: đ Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). đ Python in Practice: Automation ideas, coding insights, and tool development. đĄ Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. đż Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Letâs keep exploringâclearly, curiously, and together. đ
While Scrapy is excellent for static sites, it cannot execute JavaScript. Many modern websites load content dynamically using JavaScript. To scrape these sites, we can integrate Scrapy with Selenium.
When to Use This Integration?
Use this integration when:
The data you need is loaded via JavaScript (AJAX).
You need to interact with the page (click buttons, scroll) to reveal content.
The site uses complex anti-scraping measures that require a real browser fingerprint.
Setup
First, install the necessary packages:
pip install scrapy selenium
You will also need a WebDriver for your browser (e.g., ChromeDriver).
Implementation Strategy
The most common way to integrate them is to use a Downloader Middleware. This middleware intercepts the request from Scrapy, uses Selenium to load the page, and then returns the HTML content back to Scrapy as a response.
1. Create the Middleware
In your middlewares.py file:
# middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class SeleniumMiddleware:
def __init__(self):
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in headless mode
self.driver = webdriver.Chrome(options=chrome_options)
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
return middleware
def process_request(self, request, spider):
# Only use Selenium for requests with a specific meta key
if request.meta.get('selenium'):
self.driver.get(request.url)
# You can add waits or interactions here
# self.driver.implicitly_wait(5)
body = self.driver.page_source
return HtmlResponse(
self.driver.current_url,
body=body,
encoding='utf-8',
request=request
)
return None
def spider_closed(self):
self.driver.quit()
2. Enable the Middleware
In your settings.py, enable the middleware:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.SeleniumMiddleware': 543,
}
3. Use it in Your Spider
Now, in your spider, you can pass meta={'selenium': True} to requests that need Selenium:
# spiders/dynamic_spider.py
import scrapy
class DynamicSpider(scrapy.Spider):
name = "dynamic"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/dynamic-content",
meta={'selenium': True},
callback=self.parse
)
def parse(self, response):
# Now response.body contains the HTML rendered by Selenium
title = response.css("h1::text").get()
yield {'title': title}
Pros and Cons
Pros: Allows scraping of any website, regardless of JavaScript.
Cons: Significantly slower than pure Scrapy. You lose the speed benefit of Scrapy's async architecture for these requests.
Next Steps
In the next article, we will look at how to integrate Scrapy with Playwright, a modern alternative to Selenium.