How to Use Scrapy for Stealthy Web Scraping Without Getting Caught
Iâm Ravikirana B â an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that donât just workâthey make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What youâll find here: đ Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). đ Python in Practice: Automation ideas, coding insights, and tool development. đĄ Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. đż Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Letâs keep exploringâclearly, curiously, and together. đ
Before you reach for heavy tools like Playwright or expensive proxies, you can do a LOT to avoid detection using just pure Scrapy. This guide covers every possible technique to make your standard Scrapy spider look more human.
1. The Golden Rule: Don't Act Like a Robot
Robots are fast, precise, and repetitive. Humans are slow, random, and messy. To avoid detection, your spider must mimic human behavior.
2. User-Agent Rotation (The Basics)
The User-Agent header tells the server what browser you are using. By default, Scrapy says "Scrapy/2.x". This is an instant ban on many sites.
Solution: Rotate through a list of real browser User-Agents.
Step-by-Step Implementation:
Install the library: Open your terminal and run:
pip install scrapy-user-agentsEdit
settings.py: Open thesettings.pyfile in your project folder. Find theDOWNLOADER_MIDDLEWARESsection (or create it if it doesn't exist) and paste this:# settings.py DOWNLOADER_MIDDLEWARES = { # Disable the default UserAgent middleware 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # Enable the random UserAgent middleware 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }
3. Headers: The "Fingerprint" of a Browser
Browsers send a specific set of headers with every request. If you only send a User-Agent, it looks suspicious.
Solution: Copy the full headers from a real browser request.
How to get them:
Open Chrome -> Network Tab.
Refresh the page.
Right-click the main request -> Copy -> Copy as cURL (bash).
Use a tool (like curlconverter.com) to convert it to a Python dictionary.
Where to put them: You can put them in settings.py to apply to every request, or in your spider for specific requests.
Option A: Global Settings (In settings.py)
# settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
Option B: Per Spider (In spiders/myspider.py)
# spiders/myspider.py
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,...',
# ... paste headers here
}
}
4. Random Delays (Politeness)
Robots hit pages instantly. Humans take time to read.
Solution: Slow down your spider and make it random.
Where to put it: Open settings.py and add/change these lines:
# settings.py
# Enable Auto-Throttling (Scrapy adjusts speed based on server load)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60
# Add a random delay between requests
# If set to 2, Scrapy will wait between 1s and 3s randomly
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
5. Cookies and Sessions
Some sites track your "session". If you make 100 requests with no cookies (or the same cookie for too long), it looks weird.
Scenario A: Disable Cookies (General Scraping) If the site tracks users to ban them, disable cookies so every request looks like a new visitor.
In settings.py:
COOKIES_ENABLED = False
Scenario B: Maintain Session (Login/Complex Sites) If the site requires a session, keep cookies enabled (default) but be careful not to make too many requests from one "user".
6. Referer Spoofing
When you click a link from Google to a site, the Referer header says "google.com". If you go directly to a product page with no Referer, it looks like a bot.
Solution: Fake the Referer header.
Where to put it: Inside your spider code.
# spiders/myspider.py
def start_requests(self):
yield scrapy.Request(
url="https://example.com/product/123",
headers={'Referer': 'https://www.google.com/'}, # <--- Add this
callback=self.parse
)
7. Concurrency Limits
Don't hammer the server.
In settings.py:
CONCURRENT_REQUESTS = 8 # Default is 16, lower is safer
CONCURRENT_REQUESTS_PER_DOMAIN = 4
Complete Example: Putting It All Together
Here is a complete settings.py file optimized for stealth. You can copy-paste this into your project.
# settings.py
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
# 1. Rotate User Agents
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
# 2. Real Browser Headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
}
# 3. Random Delays
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True
# 4. Disable Cookies (Optional, depends on site)
COOKIES_ENABLED = False
# 5. Limit Concurrency
CONCURRENT_REQUESTS = 8
# Respect robots.txt (Good practice, but sometimes you need to disable it)
ROBOTSTXT_OBEY = True
By applying all these settings, you can scrape a surprising number of "protected" sites using just pure Scrapy, saving you the overhead of using a full browser.