Skip to main content

Command Palette

Search for a command to run...

How to Effectively Debug Scrapy Spiders

Updated
3 min read
R

I’m Ravikirana B – an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that don’t just work—they make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What you’ll find here: 🔌 Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). 🐍 Python in Practice: Automation ideas, coding insights, and tool development. 💡 Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. 🌿 Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Let’s keep exploring—clearly, curiously, and together. 🙌

Debugging asynchronous code can be challenging. Since Scrapy is based on Twisted, standard debugging techniques might not always work as expected. However, Scrapy provides several powerful tools to help you debug your spiders.

1. The Scrapy Shell

The Scrapy shell is your best friend. It allows you to test your extraction code without running the full spider.

Usage:

scrapy shell "https://quotes.toscrape.com"

Inside the shell, you can try out your CSS or XPath selectors:

>> > response.css("div.quote")
[...]
>> > response.css("div.quote span.text::text").get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

Tip: You can also open the shell from within your spider code using scrapy.shell.inspect_response:

def parse(self, response):
    from scrapy.shell import inspect_response
    inspect_response(response, self)
    # ... rest of your code

When the spider hits this line, it will pause and open a shell in your terminal, allowing you to inspect the response object right there.

2. Logging

Scrapy has a robust logging system. You can use it to track the flow of your spider and spot errors.

import logging


class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        self.logger.info("Visited %s", response.url)
        if "error" in response.text:
            self.logger.error("Error found on page: %s", response.url)

Check the console output for INFO, WARNING, and ERROR logs.

3. Parse Command

The parse command allows you to verify your spider method against a specific URL.

scrapy parse --spider=quotes --callback=parse --depth=1 "https://quotes.toscrape.com"

This will run the parse method of the quotes spider on the given URL and show you the extracted items.

4. Common Issues and Fixes

4.1. Empty Output

  • Check your selectors: Use scrapy shell to verify them.

  • Check for JavaScript: If the data is missing in view-source: but present in "Inspect Element", the site uses JS. You need Selenium or Playwright.

  • Check robots.txt: Scrapy respects robots.txt by default. Set ROBOTSTXT_OBEY = False in settings.py to ignore it (be careful with this).

4.2. 403 Forbidden

  • User-Agent: Many sites block the default Scrapy User-Agent. Change it in settings.py:

      USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    

4.3. Missing Items

  • Asynchronous Loading: The data might be loaded via a separate API call. Check the "Network" tab in your browser's developer tools.

5. Using a Debugger (PDB)

You can use Python's built-in debugger pdb.

import pdb


def parse(self, response):
    pdb.set_trace()
    # ...

When the spider reaches this line, it will pause, and you can inspect variables. Note that this blocks the entire reactor, so all other requests will pause too.

Next Steps

In the next article, we will cover advanced scenarios and best practices.

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.