Best Practices and Advanced Situations Explained
Iâm Ravikirana B â an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that donât just workâthey make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What youâll find here: đ Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). đ Python in Practice: Automation ideas, coding insights, and tool development. đĄ Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. đż Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Letâs keep exploringâclearly, curiously, and together. đ
In this final article, we will cover some advanced Scrapy scenarios and best practices to help you build robust and scalable scrapers.
1. Handling Pagination
Most scraping tasks involve following "Next" buttons to scrape multiple pages.
def parse(self, response):
# ... extract items ...
# Find the next page link
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
response.follow supports relative URLs, so you don't need to construct the full URL manually.
2. Handling Login Forms
To scrape data behind a login, you need to send a POST request with your credentials.
class LoginSpider(scrapy.Spider):
name = 'login'
start_urls = ['https://quotes.toscrape.com/login']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'myuser', 'password': 'mypassword'},
callback=self.after_login
)
def after_login(self, response):
# Check if login was successful
if "Logout" in response.text:
self.logger.info("Login successful")
# Continue scraping
else:
self.logger.error("Login failed")
3. Avoiding Bans
Websites often block scrapers. Here are some tips to avoid getting banned:
Rotate User Agents: Use
scrapy-user-agentsmiddleware to rotate User-Agent headers.Rotate IPs: Use a proxy service and
scrapy-rotating-proxies.Slow Down: Increase
DOWNLOAD_DELAYinsettings.py.DOWNLOAD_DELAY = 2 # Wait 2 seconds between requestsDisable Cookies: If not needed, disable cookies to prevent tracking.
COOKIES_ENABLED = False
4. Storing Data
While JSON/CSV exports are good for small tasks, for larger projects, you should use a database.
Example: Saving to MongoDB
Install
pymongo.Create a pipeline in
pipelines.py:
import pymongo
class MongoPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[spider.name].insert_one(dict(item))
return item
- Add settings to
settings.py:
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'
ITEM_PIPELINES = {
'myproject.pipelines.MongoPipeline': 300,
}
5. Best Practices Checklist
[ ] Respect
robots.txtwhenever possible.[ ] Use Items: Define structured Items instead of yielding raw dictionaries.
[ ] Write Tests: Use
scrapy.contractsor unit tests for your spiders.[ ] Monitor: Use logging and tools like Spidermon to monitor your spiders.
[ ] Clean Data: Use pipelines to clean and validate data before storage.
Conclusion
You have now covered the journey from installing Scrapy to handling advanced scenarios. Scrapy is a versatile tool, and mastering it will give you the power to access data from all over the web. Happy scraping!