Skip to main content

Command Palette

Search for a command to run...

How to Master CSS Selectors and Advanced Debugging Techniques

Updated
•3 min read
R

I’m Ravikirana B – an engineer driven by curiosity and clarity. My work sits at the intersection of hardware and software. I specialize in Python programming and electronics, building real-world solutions that don’t just work—they make sense. I started 'Tech Priya' with a simple mission: to share the joy of technology. "Priya" means dear or beloved, and this platform is dedicated to everyone who loves to understand the "why" and "how" behind the machines we use every day. What you’ll find here: 🔌 Electronics Simplified: Complex circuits explained with relatable analogies (think water tanks, gates, and traffic flows). 🐍 Python in Practice: Automation ideas, coding insights, and tool development. 💡 Real Reflections: Honest takes on tech, bridging the gap between textbook theory and hands-on reality. 🌿 Native Connection: Tech concepts explained with a Kannada-English touch to make learning feel like home. I believe technology shouldn't be a barrier. Whether you are a student from a small town or a self-learner with big dreams, Tech Priya is here to make the complex simple. Let’s keep exploring—clearly, curiously, and together. 🙌

In this article, we will dive deeper into how to effectively select data, debug complex issues, and manage logs to speed up your Scrapy development.

1. Mastering Selectors

Finding the right selector is the core of web scraping. Scrapy supports both CSS and XPath selectors.

How to Find Selectors

  1. Browser Developer Tools:

    • Right-click on the element you want to scrape and select "Inspect".

    • In the Elements panel, you can see the HTML structure.

    • Tip: Right-click the element in the HTML view -> Copy -> Copy selector (or Copy XPath). Note: Browser-generated selectors are often brittle. It's better to write your own.

  2. Scrapy Shell (The Best Way): Always test your selectors in the shell before putting them in your spider.

     scrapy shell "https://quotes.toscrape.com"
    

CSS vs. XPath

  • CSS: Easier to read and write. Good for simple selection by class or ID.

      response.css('div.quote span.text::text').get()
    
  • XPath: More powerful. Can traverse up the DOM (parents), select by text content, and use complex logic.

      response.xpath('//div[@class="quote"]/span[@class="text"]/text()').get()
    

Advanced Selection Techniques

  • Contains Text (XPath): Select elements that contain specific text.

      response.xpath('//a[contains(text(), "Next")]/@href').get()
    
  • Siblings: Select the element next to a label.

      # <label>Price:</label> <span>$10</span>
      response.xpath('//label[text()="Price:"]/following-sibling::span/text()').get()
    
  • Attributes: Extracting links or image sources.

      response.css('a::attr(href)').get()
      response.xpath('//img/@src').get()
    
  • Regular Expressions: Extract specific patterns from text.

      # Text: "Price: $10.50" -> Extract "10.50"
      response.css('p.price::text').re_first(r'\$(\d+\.\d+)')
    

2. Advanced Debugging Techniques

Is the Request Reaching the Page?

Sometimes your spider runs but returns nothing. Here is how to diagnose:

  1. Check the Status Code: In your logs, look for the status code of the response.

    • 200: OK. The page loaded.

    • 301/302: Redirect. Scrapy follows these by default.

    • 403: Forbidden. You are likely blocked (User-Agent or IP ban).

    • 404: Not Found. URL is wrong.

    • 500: Server Error.

  2. Inspect the Response Body: Sometimes the server returns a 200 OK, but the content is a "Please enable JavaScript" message or a CAPTCHA.

    Use open_in_browser to see exactly what Scrapy sees:

     from scrapy.utils.response import open_in_browser
    
     def parse(self, response):
         open_in_browser(response)
         # ...
    

    This will save the raw HTML response to a temporary file and open it in your default web browser.

Debugging the Data Flow

If you are not getting the data you expect:

  1. scrapy.shell.inspect_response: Pause the spider and inspect the response in the shell.

     def parse(self, response):
         if not response.css('.product-list'):
             from scrapy.shell import inspect_response
             inspect_response(response, self)
    
  2. Check for Dynamic Content: If response.body (viewed via open_in_browser) is different from what you see in Chrome, the content is likely loaded via JavaScript. You need Selenium or Playwright.

3. Managing Logs

Scrapy logs can be overwhelming. Here is how to tame them.

Filtering Logs

In settings.py, you can control the log level:

# Options: CRITICAL, ERROR, WARNING, INFO, DEBUG
LOG_LEVEL = 'INFO'
  • DEBUG: Very verbose. Shows every request and response.

  • INFO: Shows opened spiders, scraped items, and errors.

  • WARNING: Only warnings and errors.

Custom Logging

You can log specific events in your spider to trace execution without the noise.

def parse(self, response):
    self.logger.info(f"Processing page: {response.url}")
    items = response.css('.item')
    if not items:
        self.logger.warning(f"No items found on {response.url}")

Saving Logs to a File

Instead of printing to the console, save logs to a file for later analysis.

scrapy crawl myspider --logfile=spider.log

Or in settings.py:

LOG_FILE = 'spider.log'

Conclusion

By mastering selectors, using advanced debugging tools like open_in_browser, and managing your logs effectively, you can become a highly efficient Scrapy developer.

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.