Skip to main content

Command Palette

Search for a command to run...

How to Use Scrapy for Stealthy Web Scraping Without Getting Caught

Updated
•4 min read

Before you reach for heavy tools like Playwright or expensive proxies, you can do a LOT to avoid detection using just pure Scrapy. This guide covers every possible technique to make your standard Scrapy spider look more human.

1. The Golden Rule: Don't Act Like a Robot

Robots are fast, precise, and repetitive. Humans are slow, random, and messy. To avoid detection, your spider must mimic human behavior.


2. User-Agent Rotation (The Basics)

The User-Agent header tells the server what browser you are using. By default, Scrapy says "Scrapy/2.x". This is an instant ban on many sites.

Solution: Rotate through a list of real browser User-Agents.

Step-by-Step Implementation:

  1. Install the library: Open your terminal and run:

     pip install scrapy-user-agents
    
  2. Edit settings.py: Open the settings.py file in your project folder. Find the DOWNLOADER_MIDDLEWARES section (or create it if it doesn't exist) and paste this:

     # settings.py
    
     DOWNLOADER_MIDDLEWARES = {
         # Disable the default UserAgent middleware
         'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
         # Enable the random UserAgent middleware
         'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
     }
    

3. Headers: The "Fingerprint" of a Browser

Browsers send a specific set of headers with every request. If you only send a User-Agent, it looks suspicious.

Solution: Copy the full headers from a real browser request.

How to get them:

  1. Open Chrome -> Network Tab.

  2. Refresh the page.

  3. Right-click the main request -> Copy -> Copy as cURL (bash).

  4. Use a tool (like curlconverter.com) to convert it to a Python dictionary.

Where to put them: You can put them in settings.py to apply to every request, or in your spider for specific requests.

Option A: Global Settings (In settings.py)

# settings.py

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Cache-Control': 'max-age=0',
}

Option B: Per Spider (In spiders/myspider.py)

# spiders/myspider.py

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'text/html,...',
            # ... paste headers here
        }
    }

4. Random Delays (Politeness)

Robots hit pages instantly. Humans take time to read.

Solution: Slow down your spider and make it random.

Where to put it: Open settings.py and add/change these lines:

# settings.py

# Enable Auto-Throttling (Scrapy adjusts speed based on server load)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60

# Add a random delay between requests
# If set to 2, Scrapy will wait between 1s and 3s randomly
DOWNLOAD_DELAY = 2 
RANDOMIZE_DOWNLOAD_DELAY = True

5. Cookies and Sessions

Some sites track your "session". If you make 100 requests with no cookies (or the same cookie for too long), it looks weird.

Scenario A: Disable Cookies (General Scraping) If the site tracks users to ban them, disable cookies so every request looks like a new visitor.

In settings.py:

COOKIES_ENABLED = False

Scenario B: Maintain Session (Login/Complex Sites) If the site requires a session, keep cookies enabled (default) but be careful not to make too many requests from one "user".


6. Referer Spoofing

When you click a link from Google to a site, the Referer header says "google.com". If you go directly to a product page with no Referer, it looks like a bot.

Solution: Fake the Referer header.

Where to put it: Inside your spider code.

# spiders/myspider.py

def start_requests(self):
    yield scrapy.Request(
        url="https://example.com/product/123",
        headers={'Referer': 'https://www.google.com/'}, # <--- Add this
        callback=self.parse
    )

7. Concurrency Limits

Don't hammer the server.

In settings.py:

CONCURRENT_REQUESTS = 8  # Default is 16, lower is safer
CONCURRENT_REQUESTS_PER_DOMAIN = 4

Complete Example: Putting It All Together

Here is a complete settings.py file optimized for stealth. You can copy-paste this into your project.

# settings.py

BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# 1. Rotate User Agents
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

# 2. Real Browser Headers
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Upgrade-Insecure-Requests': '1',
}

# 3. Random Delays
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True

# 4. Disable Cookies (Optional, depends on site)
COOKIES_ENABLED = False

# 5. Limit Concurrency
CONCURRENT_REQUESTS = 8

# Respect robots.txt (Good practice, but sometimes you need to disable it)
ROBOTSTXT_OBEY = True

By applying all these settings, you can scrape a surprising number of "protected" sites using just pure Scrapy, saving you the overhead of using a full browser.

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.