Debugging Tips for Scrapy Spiders

Debugging asynchronous code can be challenging. Since Scrapy is based on Twisted, standard debugging techniques might not always work as expected. However, Scrapy provides several powerful tools to help you debug your spiders.

1. The Scrapy Shell

The Scrapy shell is your best friend. It allows you to test your extraction code without running the full spider.

Usage:

scrapy shell "https://quotes.toscrape.com"

Inside the shell, you can try out your CSS or XPath selectors:

>> > response.css("div.quote")
[...]
>> > response.css("div.quote span.text::text").get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

Tip: You can also open the shell from within your spider code using scrapy.shell.inspect_response:

def parse(self, response):
    from scrapy.shell import inspect_response
    inspect_response(response, self)
    # ... rest of your code

When the spider hits this line, it will pause and open a shell in your terminal, allowing you to inspect the response object right there.

2. Logging

Scrapy has a robust logging system. You can use it to track the flow of your spider and spot errors.

import logging


class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        self.logger.info("Visited %s", response.url)
        if "error" in response.text:
            self.logger.error("Error found on page: %s", response.url)

Check the console output for INFO, WARNING, and ERROR logs.

3. Parse Command

The parse command allows you to verify your spider method against a specific URL.

scrapy parse --spider=quotes --callback=parse --depth=1 "https://quotes.toscrape.com"

This will run the parse method of the quotes spider on the given URL and show you the extracted items.

4. Common Issues and Fixes

4.1. Empty Output

Check your selectors: Use scrapy shell to verify them.
Check for JavaScript: If the data is missing in view-source: but present in "Inspect Element", the site uses JS. You need Selenium or Playwright.
Check robots.txt: Scrapy respects robots.txt by default. Set ROBOTSTXT_OBEY = False in settings.py to ignore it (be careful with this).

4.2. 403 Forbidden

User-Agent: Many sites block the default Scrapy User-Agent. Change it in settings.py:

  USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

4.3. Missing Items

Asynchronous Loading: The data might be loaded via a separate API call. Check the "Network" tab in your browser's developer tools.

5. Using a Debugger (PDB)

You can use Python's built-in debugger pdb.

import pdb


def parse(self, response):
    pdb.set_trace()
    # ...

When the spider reaches this line, it will pause, and you can inspect variables. Note that this blocks the entire reactor, so all other requests will pause too.

Next Steps

In the next article, we will cover advanced scenarios and best practices.

How to Effectively Debug Scrapy Spiders

1. The Scrapy Shell

2. Logging

3. Parse Command

4. Common Issues and Fixes

4.1. Empty Output

4.2. 403 Forbidden

4.3. Missing Items

5. Using a Debugger (PDB)

Next Steps

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

Step-by-Step Guide to Using Scrapy with Playwright

More from this blog

The Physics of Resistance: Suresh the Security Guard and Ohm's Law

How to Avoid Bot Detection Using Scrapy and Playwright

How to Use Scrapy for Stealthy Web Scraping Without Getting Caught

The Ultimate Decision Guide: Scrapy vs. Playwright vs. Selenium vs. Proxies

Essential AI Prompts to Boost Your Scrapy Development

Command Palette

1. The Scrapy Shell

2. Logging

3. Parse Command

4. Common Issues and Fixes

4.1. Empty Output

4.2. 403 Forbidden

4.3. Missing Items

5. Using a Debugger (PDB)

Next Steps

Comments

Mastering Web Scraping with Scrapy: From Zero to Hero

Step-by-Step Guide to Using Scrapy with Playwright

More from this blog