Skip to main content

Command Palette

Search for a command to run...

Beginner's Guide to Mastering CSS and XPath Selectors

Updated
•4 min read

Web scraping is all about selecting the right data. If you can't select it, you can't scrape it. In this guide, we will break down CSS and XPath selectors from the very basics to advanced filtering, so even if you've never used them before, you'll be a pro by the end.

1. What are Selectors?

Imagine a webpage is like a library.

  • HTML is the building.

  • Elements (like <div>, <a>, <p>) are the books.

  • Selectors are the instructions to find a specific book (e.g., "Go to the 3rd shelf, 2nd book from the left").

Scrapy uses two types of selectors:

  1. CSS Selectors: Easy to read, similar to how you style websites.

  2. XPath Selectors: More powerful, allows complex logic.


2. CSS Selectors: The Basics

CSS selectors are great for simple tasks.

Selecting by Tag

To select all paragraphs <p>:

response.css('p')

Selecting by Class (.)

To select elements with class="price":

response.css('.price')

Example HTML: <div class="price">100</div>

Selecting by ID (#)

To select an element with id="main-title":

response.css('#main-title')

Example HTML: <h1 id="main-title">Welcome</h1>

Combining Them

To select a div that has the class quote:

response.css('div.quote')

Nested Selection (Descendants)

To select a span inside a div with class quote:

response.css('div.quote span')

3. XPath Selectors: The Powerhouse

XPath looks a bit like a file path on your computer.

Selecting by Tag

To select all div elements:

response.xpath('//div')
  • // means "search anywhere in the document".

  • / means "direct child" (must be immediately inside).

Selecting by Attribute

To select a div with class="quote":

response.xpath('//div[@class="quote"]')
  • @ is used for attributes (class, id, href, src, etc.).

Selecting by Text

This is where XPath shines. To select a button that says "Next Page":

response.xpath('//button[text()="Next Page"]')

Contains (Partial Match)

If the class is product-item active and you just want to match product-item:

response.xpath('//div[contains(@class, "product-item")]')

Or matching text that contains "Price":

response.xpath('//span[contains(text(), "Price")]')

4. Extracting Data: Getting the Good Stuff

Once you've selected the element, you need to extract the data (text, link, etc.).

Getting Text

CSS:

response.css('span.text::text').get()

XPath:

response.xpath('//span[@class="text"]/text()').get()

To get the URL from <a href="https://example.com">:

CSS:

response.css('a::attr(href)').get()

XPath:

response.xpath('//a/@href').get()

get() vs getall()

  • get(): Returns the first match as a string.

  • getall(): Returns all matches as a list of strings.

# Get all quotes on the page
quotes = response.css('div.quote span.text::text').getall()

5. Advanced Filtering and Logic

Sometimes simple selection isn't enough.

"OR" Logic

Select h1 OR h2 tags:

response.xpath('//h1 | //h2')

"AND" Logic

Select a div that has BOTH class="item" AND data-id="123":

response.xpath('//div[@class="item" and @data-id="123"]')

Selecting Based on Position

Select the first item in a list:

response.xpath('//ul/li[1]')

Select the last item:

response.xpath('//ul/li[last()]')

Selecting Siblings (Neighbors)

Imagine this HTML:

<div class="label">Price:</div>
<div class="value">$50</div>

You want the price, but it has no unique class. You can find the "Price:" label and get the next element.

response.xpath('//div[text()="Price:"]/following-sibling::div[1]/text()').get()

Selecting Parent

You found a "Buy Now" button and want to get the product title, which is in a parent container.

response.xpath('//button[@class="buy-now"]/../h2/text()').get()
  • .. moves up to the parent.

6. Real-World Cheat Sheet

GoalCSS ExampleXPath Example
Get ID#header//*[@id="header"]
Get Class.item//*[@class="item"]
Get Attributea::attr(href)//a/@href
Get Textp::text//p/text()
Contains TextNot supported//div[contains(text(), "Hello")]
ParentNot supported//div/..
Next Siblingdiv + span//div/following-sibling::span[1]

7. How to Practice

  1. Open any website (e.g., quotes.toscrape.com).

  2. Open your terminal and run: scrapy shell "https://quotes.toscrape.com"

  3. Try typing these commands:

     >>> response.css('title::text').get()
     'Quotes to Scrape'
     >>> response.xpath('//span[@class="text"]/text()').get()
     '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
    

Conclusion

CSS is great for speed and simplicity. XPath is essential for complex navigation (parents, siblings, text matching). Mastering both gives you the superpower to scrape almost any website!

More from this blog

Tech Priya

24 posts

Tech Priya is a knowledge blog where electronics, Python, and core tech concepts are explained using real-world analogies in Kannada-English, making learning clear, relatable, and enjoyable.