Step-by-Step Guide to Using Scrapy with Playwright
Playwright is a newer, faster, and more reliable browser automation tool than Selenium. Integrating it with Scrapy is often preferred for modern web scraping projects.
Why Playwright?
Faster: Generally faster execution than Selenium.
Better Waiting: Auto-waits for elements to be ready.
Modern Web Support: Better handling of modern web features.
Setup
We will use the scrapy-playwright plugin, which makes integration seamless.
Install the package:
pip install scrapy-playwright playwright install
Configuration
Update your settings.py to enable the scrapy-playwright download handler:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Using Playwright in Your Spider
To use Playwright for a request, you simply need to pass meta={"playwright": True}.
# spiders/playwright_spider.py
import scrapy
class PlaywrightSpider(scrapy.Spider):
name = "playwright_spider"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/dynamic",
meta={"playwright": True},
callback=self.parse
)
def parse(self, response):
# The response is now the rendered HTML from Playwright
yield {
"text": response.css("div.content::text").get()
}
Advanced Usage: Page Interactions
You can also interact with the page using playwright_page_methods.
from scrapy_playwright.page import PageMethod
def start_requests(self):
yield scrapy.Request(
url="https://example.com/login",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("fill", "input[name='user']", "myuser"),
PageMethod("fill", "input[name='pass']", "mypass"),
PageMethod("click", "button[type='submit']"),
PageMethod("wait_for_selector", "div.dashboard"),
],
},
callback=self.parse_dashboard
)
Comparison with Selenium Integration
| Feature | Scrapy + Selenium | Scrapy + Playwright |
| Setup | Manual Middleware | Plugin (scrapy-playwright) |
| Speed | Slower | Faster |
| Ease of Use | Moderate | Easy (with plugin) |
| Reliability | Good | Excellent |
Conclusion
For new projects requiring JavaScript rendering, Scrapy + Playwright is the recommended approach due to its performance and ease of integration.
Next Steps
In the next article, we will discuss how to debug Scrapy spiders effectively.