Skip to main content

Command Palette

Search for a command to run...

How to Avoid Bot Detection Using Scrapy and Playwright

Updated
•4 min read
R

I’m Ravikirana B – an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that don’t just work—they make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What you’ll find here: 🔌 Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). 🐍 Python in Practice: Automation ideas, coding insights, and tool development. 💡 Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. 🌿 Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Let’s keep exploring—clearly, curiously, and together. 🙌

When pure Scrapy isn't enough—when the website checks for a real browser, executes complex JavaScript, or has advanced anti-bot protection—it's time to bring in the heavy artillery: Scrapy + Playwright.

This guide shows you how to configure them together for maximum stealth, making your scraper look exactly like a real user browsing Chrome.

1. Why Playwright?

Pure Scrapy is just a script. It doesn't have a screen, a mouse, or a JavaScript engine. Playwright is a real browser (Chromium, Firefox, WebKit). It passes almost all "Are you a robot?" checks by default because it is the tool humans use.


2. Installation

First, you need to install the integration plugin and the browsers.

Run these commands in your terminal:

pip install scrapy-playwright
playwright install chromium

3. Basic Configuration

You need to tell Scrapy to use Playwright for downloading pages instead of its default downloader.

Open settings.py and add/update these lines:

# settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

# This is required for Playwright to work with Scrapy
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

4. The "Stealth" Configuration (The Secret Sauce)

Just using Playwright isn't always enough. Sophisticated sites check for "automation flags" (variables that say "Hey, I'm being controlled by a script"). We need to disable them.

Add this to your settings.py:

# settings.py

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True,  # Set to False to see the browser pop up (good for debugging)
    "args": [
        "--disable-blink-features=AutomationControlled", # <--- THE KEY to stealth
        "--no-sandbox",
    ],
}

PLAYWRIGHT_CONTEXT_ARGS = {
    "javaScriptEnabled": True,
    "ignoreHTTPSErrors": True,
    # Set a real browser viewport size
    "viewport": {"width": 1920, "height": 1080},
    # Set a real User-Agent (very important!)
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}
  • --disable-blink-features=AutomationControlled: This removes the "I am a robot" flag that Chrome usually sends when controlled by code.

  • user_agent: We manually set a modern Chrome user agent.


5. How to Use It in Your Spider

Now that settings are configured, you need to tell your spider to use Playwright for specific requests.

In your spider file (e.g., spiders/myspider.py):

import scrapy

class StealthSpider(scrapy.Spider):
    name = "stealth"

    def start_requests(self):
        yield scrapy.Request(
            url="https://nowsecure.nl",  # A site to test security
            meta={
                "playwright": True,
                "playwright_include_page": True, # Optional: if you need to interact with the page
            },
            callback=self.parse
        )

    async def parse(self, response):
        # Extract data normally
        title = response.css('title::text').get()
        print(f"Title: {title}")

        # If you need to interact (click/scroll), you get the 'page' object
        page = response.meta["playwright_page"]
        await page.close()

6. Advanced Stealth: Randomizing User-Agents

Using the same User-Agent for every request is suspicious. Let's randomize it for every request.

Update your Spider to pass context arguments dynamically:

import scrapy
import random

# List of real User-Agents
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
]

class RandomStealthSpider(scrapy.Spider):
    name = "random_stealth"

    def start_requests(self):
        ua = random.choice(USER_AGENTS)
        yield scrapy.Request(
            url="https://bot.sannysoft.com", # A bot detection test site
            meta={
                "playwright": True,
                "playwright_context_args": {
                    "user_agent": ua,
                    "viewport": {"width": 1920, "height": 1080},
                }
            },
            callback=self.parse
        )

    def parse(self, response):
        # ... extraction logic
        pass

7. Complete settings.py for Copy-Paste

Here is the full configuration block for settings.py.

# settings.py

# 1. Enable Playwright
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

# 2. Launch Options (The Browser App)
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True, # Set False to watch it work
    "args": [
        "--disable-blink-features=AutomationControlled", # Hides the 'robot' flag
        "--no-sandbox",
    ],
}

# 3. Context Options (The Browser Tab)
PLAYWRIGHT_CONTEXT_ARGS = {
    "javaScriptEnabled": True,
    "ignoreHTTPSErrors": True,
    "viewport": {"width": 1280, "height": 720},
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}

# 4. Standard Scrapy Politeness (Still applies!)
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 4

Summary

  1. Install scrapy-playwright.

  2. Configure DOWNLOAD_HANDLERS and TWISTED_REACTOR.

  3. Add Stealth Args: --disable-blink-features=AutomationControlled is the most important line.

  4. Use Meta: Pass meta={"playwright": True} in your requests.

With this setup, you are running a real Chrome browser that explicitly lies about being automated. This bypasses 99% of bot detection systems.

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.