Stealth Scraping with Scrapy Tips

Before you reach for heavy tools like Playwright or expensive proxies, you can do a LOT to avoid detection using just pure Scrapy. This guide covers every possible technique to make your standard Scrapy spider look more human.

1. The Golden Rule: Don't Act Like a Robot

Robots are fast, precise, and repetitive. Humans are slow, random, and messy. To avoid detection, your spider must mimic human behavior.

2. User-Agent Rotation (The Basics)

The User-Agent header tells the server what browser you are using. By default, Scrapy says "Scrapy/2.x". This is an instant ban on many sites.

Solution: Rotate through a list of real browser User-Agents.

Step-by-Step Implementation:

Install the library: Open your terminal and run:
```
 pip install scrapy-user-agents
```

Edit settings.py: Open the settings.py file in your project folder. Find the DOWNLOADER_MIDDLEWARES section (or create it if it doesn't exist) and paste this:

 # settings.py

 DOWNLOADER_MIDDLEWARES = {
     # Disable the default UserAgent middleware
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
     # Enable the random UserAgent middleware
     'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
 }

3. Headers: The "Fingerprint" of a Browser

Browsers send a specific set of headers with every request. If you only send a User-Agent, it looks suspicious.

Solution: Copy the full headers from a real browser request.

How to get them:

Open Chrome -> Network Tab.
Refresh the page.
Right-click the main request -> Copy -> Copy as cURL (bash).
Use a tool (like curlconverter.com) to convert it to a Python dictionary.

Where to put them: You can put them in settings.py to apply to every request, or in your spider for specific requests.

Option A: Global Settings (In settings.py)

# settings.py

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Cache-Control': 'max-age=0',
}

Option B: Per Spider (In spiders/myspider.py)

# spiders/myspider.py

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'text/html,...',
            # ... paste headers here
        }
    }

4. Random Delays (Politeness)

Robots hit pages instantly. Humans take time to read.

Solution: Slow down your spider and make it random.

Where to put it: Open settings.py and add/change these lines:

# settings.py

# Enable Auto-Throttling (Scrapy adjusts speed based on server load)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60

# Add a random delay between requests
# If set to 2, Scrapy will wait between 1s and 3s randomly
DOWNLOAD_DELAY = 2 
RANDOMIZE_DOWNLOAD_DELAY = True

5. Cookies and Sessions

Some sites track your "session". If you make 100 requests with no cookies (or the same cookie for too long), it looks weird.

Scenario A: Disable Cookies (General Scraping) If the site tracks users to ban them, disable cookies so every request looks like a new visitor.

In settings.py:

COOKIES_ENABLED = False

Scenario B: Maintain Session (Login/Complex Sites) If the site requires a session, keep cookies enabled (default) but be careful not to make too many requests from one "user".

6. Referer Spoofing

When you click a link from Google to a site, the Referer header says "google.com". If you go directly to a product page with no Referer, it looks like a bot.

Solution: Fake the Referer header.

Where to put it: Inside your spider code.

# spiders/myspider.py

def start_requests(self):
    yield scrapy.Request(
        url="https://example.com/product/123",
        headers={'Referer': 'https://www.google.com/'}, # <--- Add this
        callback=self.parse
    )

7. Concurrency Limits

Don't hammer the server.

In settings.py:

CONCURRENT_REQUESTS = 8  # Default is 16, lower is safer
CONCURRENT_REQUESTS_PER_DOMAIN = 4

Complete Example: Putting It All Together

Here is a complete settings.py file optimized for stealth. You can copy-paste this into your project.

# settings.py

BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# 1. Rotate User Agents
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

# 2. Real Browser Headers
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Upgrade-Insecure-Requests': '1',
}

# 3. Random Delays
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True

# 4. Disable Cookies (Optional, depends on site)
COOKIES_ENABLED = False

# 5. Limit Concurrency
CONCURRENT_REQUESTS = 8

# Respect robots.txt (Good practice, but sometimes you need to disable it)
ROBOTSTXT_OBEY = True

By applying all these settings, you can scrape a surprising number of "protected" sites using just pure Scrapy, saving you the overhead of using a full browser.

How to Use Scrapy for Stealthy Web Scraping Without Getting Caught

1. The Golden Rule: Don't Act Like a Robot

2. User-Agent Rotation (The Basics)

3. Headers: The "Fingerprint" of a Browser

4. Random Delays (Politeness)

5. Cookies and Sessions

6. Referer Spoofing

7. Concurrency Limits

Complete Example: Putting It All Together

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

How to Avoid Bot Detection Using Scrapy and Playwright

More from this blog

The Physics of Resistance: Suresh the Security Guard and Ohm's Law

How to Avoid Bot Detection Using Scrapy and Playwright

The Ultimate Decision Guide: Scrapy vs. Playwright vs. Selenium vs. Proxies

Essential AI Prompts to Boost Your Scrapy Development

Command Palette

1. The Golden Rule: Don't Act Like a Robot

2. User-Agent Rotation (The Basics)

3. Headers: The "Fingerprint" of a Browser

4. Random Delays (Politeness)

5. Cookies and Sessions

6. Referer Spoofing

7. Concurrency Limits

Complete Example: Putting It All Together

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

How to Avoid Bot Detection Using Scrapy and Playwright

More from this blog