How to Use Scrapy for Stealthy Web Scraping Without Getting Caught
Before you reach for heavy tools like Playwright or expensive proxies, you can do a LOT to avoid detection using just pure Scrapy. This guide covers every possible technique to make your standard Scrapy spider look more human.
1. The Golden Rule: Don't Act Like a Robot
Robots are fast, precise, and repetitive. Humans are slow, random, and messy. To avoid detection, your spider must mimic human behavior.
2. User-Agent Rotation (The Basics)
The User-Agent header tells the server what browser you are using. By default, Scrapy says "Scrapy/2.x". This is an instant ban on many sites.
Solution: Rotate through a list of real browser User-Agents.
Step-by-Step Implementation:
Install the library: Open your terminal and run:
pip install scrapy-user-agentsEdit
settings.py: Open thesettings.pyfile in your project folder. Find theDOWNLOADER_MIDDLEWARESsection (or create it if it doesn't exist) and paste this:# settings.py DOWNLOADER_MIDDLEWARES = { # Disable the default UserAgent middleware 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # Enable the random UserAgent middleware 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }
3. Headers: The "Fingerprint" of a Browser
Browsers send a specific set of headers with every request. If you only send a User-Agent, it looks suspicious.
Solution: Copy the full headers from a real browser request.
How to get them:
Open Chrome -> Network Tab.
Refresh the page.
Right-click the main request -> Copy -> Copy as cURL (bash).
Use a tool (like curlconverter.com) to convert it to a Python dictionary.
Where to put them: You can put them in settings.py to apply to every request, or in your spider for specific requests.
Option A: Global Settings (In settings.py)
# settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
Option B: Per Spider (In spiders/myspider.py)
# spiders/myspider.py
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,...',
# ... paste headers here
}
}
4. Random Delays (Politeness)
Robots hit pages instantly. Humans take time to read.
Solution: Slow down your spider and make it random.
Where to put it: Open settings.py and add/change these lines:
# settings.py
# Enable Auto-Throttling (Scrapy adjusts speed based on server load)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60
# Add a random delay between requests
# If set to 2, Scrapy will wait between 1s and 3s randomly
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
5. Cookies and Sessions
Some sites track your "session". If you make 100 requests with no cookies (or the same cookie for too long), it looks weird.
Scenario A: Disable Cookies (General Scraping) If the site tracks users to ban them, disable cookies so every request looks like a new visitor.
In settings.py:
COOKIES_ENABLED = False
Scenario B: Maintain Session (Login/Complex Sites) If the site requires a session, keep cookies enabled (default) but be careful not to make too many requests from one "user".
6. Referer Spoofing
When you click a link from Google to a site, the Referer header says "google.com". If you go directly to a product page with no Referer, it looks like a bot.
Solution: Fake the Referer header.
Where to put it: Inside your spider code.
# spiders/myspider.py
def start_requests(self):
yield scrapy.Request(
url="https://example.com/product/123",
headers={'Referer': 'https://www.google.com/'}, # <--- Add this
callback=self.parse
)
7. Concurrency Limits
Don't hammer the server.
In settings.py:
CONCURRENT_REQUESTS = 8 # Default is 16, lower is safer
CONCURRENT_REQUESTS_PER_DOMAIN = 4
Complete Example: Putting It All Together
Here is a complete settings.py file optimized for stealth. You can copy-paste this into your project.
# settings.py
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
# 1. Rotate User Agents
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
# 2. Real Browser Headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
}
# 3. Random Delays
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True
# 4. Disable Cookies (Optional, depends on site)
COOKIES_ENABLED = False
# 5. Limit Concurrency
CONCURRENT_REQUESTS = 8
# Respect robots.txt (Good practice, but sometimes you need to disable it)
ROBOTSTXT_OBEY = True
By applying all these settings, you can scrape a surprising number of "protected" sites using just pure Scrapy, saving you the overhead of using a full browser.