Advanced Best Practices Explained

In this final article, we will cover some advanced Scrapy scenarios and best practices to help you build robust and scalable scrapers.

1. Handling Pagination

Most scraping tasks involve following "Next" buttons to scrape multiple pages.

def parse(self, response):
    # ... extract items ...

    # Find the next page link
    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

response.follow supports relative URLs, so you don't need to construct the full URL manually.

To scrape data behind a login, you need to send a POST request with your credentials.

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://quotes.toscrape.com/login']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'myuser', 'password': 'mypassword'},
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login was successful
        if "Logout" in response.text:
            self.logger.info("Login successful")
            # Continue scraping
        else:
            self.logger.error("Login failed")

3. Avoiding Bans

Websites often block scrapers. Here are some tips to avoid getting banned:

Rotate User Agents: Use scrapy-user-agents middleware to rotate User-Agent headers.
Rotate IPs: Use a proxy service and scrapy-rotating-proxies.

Slow Down: Increase DOWNLOAD_DELAY in settings.py.

  DOWNLOAD_DELAY = 2 # Wait 2 seconds between requests

Disable Cookies: If not needed, disable cookies to prevent tracking.
```
  COOKIES_ENABLED = False
```

4. Storing Data

While JSON/CSV exports are good for small tasks, for larger projects, you should use a database.

Example: Saving to MongoDB

Install pymongo.
Create a pipeline in pipelines.py:

import pymongo


class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[spider.name].insert_one(dict(item))
        return item

Add settings to settings.py:

MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'
ITEM_PIPELINES = {
    'myproject.pipelines.MongoPipeline': 300,
}

5. Best Practices Checklist

[ ] Respect robots.txt whenever possible.
[ ] Use Items: Define structured Items instead of yielding raw dictionaries.
[ ] Write Tests: Use scrapy.contracts or unit tests for your spiders.
[ ] Monitor: Use logging and tools like Spidermon to monitor your spiders.
[ ] Clean Data: Use pipelines to clean and validate data before storage.

Conclusion

You have now covered the journey from installing Scrapy to handling advanced scenarios. Scrapy is a versatile tool, and mastering it will give you the power to access data from all over the web. Happy scraping!

Best Practices and Advanced Situations Explained

3. Avoiding Bans

4. Storing Data

Example: Saving to MongoDB

5. Best Practices Checklist

Conclusion

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

How to Master CSS Selectors and Advanced Debugging Techniques

More from this blog

The Physics of Resistance: Suresh the Security Guard and Ohm's Law

How to Avoid Bot Detection Using Scrapy and Playwright

How to Use Scrapy for Stealthy Web Scraping Without Getting Caught

The Ultimate Decision Guide: Scrapy vs. Playwright vs. Selenium vs. Proxies

Essential AI Prompts to Boost Your Scrapy Development

Command Palette

1. Handling Pagination

2. Handling Login Forms

3. Avoiding Bans

4. Storing Data

Example: Saving to MongoDB

5. Best Practices Checklist

Conclusion

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

How to Master CSS Selectors and Advanced Debugging Techniques

More from this blog