Skip to main content

Command Palette

Search for a command to run...

Best Practices and Advanced Situations Explained

Updated
•2 min read
R

I’m Ravikirana B – an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that don’t just work—they make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What you’ll find here: 🔌 Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). 🐍 Python in Practice: Automation ideas, coding insights, and tool development. 💡 Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. 🌿 Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Let’s keep exploring—clearly, curiously, and together. 🙌

In this final article, we will cover some advanced Scrapy scenarios and best practices to help you build robust and scalable scrapers.

1. Handling Pagination

Most scraping tasks involve following "Next" buttons to scrape multiple pages.

def parse(self, response):
    # ... extract items ...

    # Find the next page link
    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

response.follow supports relative URLs, so you don't need to construct the full URL manually.

2. Handling Login Forms

To scrape data behind a login, you need to send a POST request with your credentials.

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://quotes.toscrape.com/login']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'myuser', 'password': 'mypassword'},
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login was successful
        if "Logout" in response.text:
            self.logger.info("Login successful")
            # Continue scraping
        else:
            self.logger.error("Login failed")

3. Avoiding Bans

Websites often block scrapers. Here are some tips to avoid getting banned:

  • Rotate User Agents: Use scrapy-user-agents middleware to rotate User-Agent headers.

  • Rotate IPs: Use a proxy service and scrapy-rotating-proxies.

  • Slow Down: Increase DOWNLOAD_DELAY in settings.py.

      DOWNLOAD_DELAY = 2 # Wait 2 seconds between requests
    
  • Disable Cookies: If not needed, disable cookies to prevent tracking.

      COOKIES_ENABLED = False
    

4. Storing Data

While JSON/CSV exports are good for small tasks, for larger projects, you should use a database.

Example: Saving to MongoDB

  1. Install pymongo.

  2. Create a pipeline in pipelines.py:

import pymongo


class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[spider.name].insert_one(dict(item))
        return item
  1. Add settings to settings.py:
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'
ITEM_PIPELINES = {
    'myproject.pipelines.MongoPipeline': 300,
}

5. Best Practices Checklist

  • [ ] Respect robots.txt whenever possible.

  • [ ] Use Items: Define structured Items instead of yielding raw dictionaries.

  • [ ] Write Tests: Use scrapy.contracts or unit tests for your spiders.

  • [ ] Monitor: Use logging and tools like Spidermon to monitor your spiders.

  • [ ] Clean Data: Use pipelines to clean and validate data before storage.

Conclusion

You have now covered the journey from installing Scrapy to handling advanced scenarios. Scrapy is a versatile tool, and mastering it will give you the power to access data from all over the web. Happy scraping!

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.