How to Avoid Bot Detection Using Scrapy and Playwright
Iâm Ravikirana B â an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that donât just workâthey make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What youâll find here: đ Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). đ Python in Practice: Automation ideas, coding insights, and tool development. đĄ Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. đż Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Letâs keep exploringâclearly, curiously, and together. đ
When pure Scrapy isn't enoughâwhen the website checks for a real browser, executes complex JavaScript, or has advanced anti-bot protectionâit's time to bring in the heavy artillery: Scrapy + Playwright.
This guide shows you how to configure them together for maximum stealth, making your scraper look exactly like a real user browsing Chrome.
1. Why Playwright?
Pure Scrapy is just a script. It doesn't have a screen, a mouse, or a JavaScript engine. Playwright is a real browser (Chromium, Firefox, WebKit). It passes almost all "Are you a robot?" checks by default because it is the tool humans use.
2. Installation
First, you need to install the integration plugin and the browsers.
Run these commands in your terminal:
pip install scrapy-playwright
playwright install chromium
3. Basic Configuration
You need to tell Scrapy to use Playwright for downloading pages instead of its default downloader.
Open settings.py and add/update these lines:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
# This is required for Playwright to work with Scrapy
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
4. The "Stealth" Configuration (The Secret Sauce)
Just using Playwright isn't always enough. Sophisticated sites check for "automation flags" (variables that say "Hey, I'm being controlled by a script"). We need to disable them.
Add this to your settings.py:
# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True, # Set to False to see the browser pop up (good for debugging)
"args": [
"--disable-blink-features=AutomationControlled", # <--- THE KEY to stealth
"--no-sandbox",
],
}
PLAYWRIGHT_CONTEXT_ARGS = {
"javaScriptEnabled": True,
"ignoreHTTPSErrors": True,
# Set a real browser viewport size
"viewport": {"width": 1920, "height": 1080},
# Set a real User-Agent (very important!)
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}
--disable-blink-features=AutomationControlled: This removes the "I am a robot" flag that Chrome usually sends when controlled by code.user_agent: We manually set a modern Chrome user agent.
5. How to Use It in Your Spider
Now that settings are configured, you need to tell your spider to use Playwright for specific requests.
In your spider file (e.g., spiders/myspider.py):
import scrapy
class StealthSpider(scrapy.Spider):
name = "stealth"
def start_requests(self):
yield scrapy.Request(
url="https://nowsecure.nl", # A site to test security
meta={
"playwright": True,
"playwright_include_page": True, # Optional: if you need to interact with the page
},
callback=self.parse
)
async def parse(self, response):
# Extract data normally
title = response.css('title::text').get()
print(f"Title: {title}")
# If you need to interact (click/scroll), you get the 'page' object
page = response.meta["playwright_page"]
await page.close()
6. Advanced Stealth: Randomizing User-Agents
Using the same User-Agent for every request is suspicious. Let's randomize it for every request.
Update your Spider to pass context arguments dynamically:
import scrapy
import random
# List of real User-Agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
]
class RandomStealthSpider(scrapy.Spider):
name = "random_stealth"
def start_requests(self):
ua = random.choice(USER_AGENTS)
yield scrapy.Request(
url="https://bot.sannysoft.com", # A bot detection test site
meta={
"playwright": True,
"playwright_context_args": {
"user_agent": ua,
"viewport": {"width": 1920, "height": 1080},
}
},
callback=self.parse
)
def parse(self, response):
# ... extraction logic
pass
7. Complete settings.py for Copy-Paste
Here is the full configuration block for settings.py.
# settings.py
# 1. Enable Playwright
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# 2. Launch Options (The Browser App)
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True, # Set False to watch it work
"args": [
"--disable-blink-features=AutomationControlled", # Hides the 'robot' flag
"--no-sandbox",
],
}
# 3. Context Options (The Browser Tab)
PLAYWRIGHT_CONTEXT_ARGS = {
"javaScriptEnabled": True,
"ignoreHTTPSErrors": True,
"viewport": {"width": 1280, "height": 720},
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}
# 4. Standard Scrapy Politeness (Still applies!)
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 4
Summary
Install
scrapy-playwright.Configure
DOWNLOAD_HANDLERSandTWISTED_REACTOR.Add Stealth Args:
--disable-blink-features=AutomationControlledis the most important line.Use Meta: Pass
meta={"playwright": True}in your requests.
With this setup, you are running a real Chrome browser that explicitly lies about being automated. This bypasses 99% of bot detection systems.