Beginner's Guide to Mastering CSS and XPath Selectors
Web scraping is all about selecting the right data. If you can't select it, you can't scrape it. In this guide, we will break down CSS and XPath selectors from the very basics to advanced filtering, so even if you've never used them before, you'll be a pro by the end.
1. What are Selectors?
Imagine a webpage is like a library.
HTML is the building.
Elements (like
<div>,<a>,<p>) are the books.Selectors are the instructions to find a specific book (e.g., "Go to the 3rd shelf, 2nd book from the left").
Scrapy uses two types of selectors:
CSS Selectors: Easy to read, similar to how you style websites.
XPath Selectors: More powerful, allows complex logic.
2. CSS Selectors: The Basics
CSS selectors are great for simple tasks.
Selecting by Tag
To select all paragraphs <p>:
response.css('p')
Selecting by Class (.)
To select elements with class="price":
response.css('.price')
Example HTML: <div class="price">100</div>
Selecting by ID (#)
To select an element with id="main-title":
response.css('#main-title')
Example HTML: <h1 id="main-title">Welcome</h1>
Combining Them
To select a div that has the class quote:
response.css('div.quote')
Nested Selection (Descendants)
To select a span inside a div with class quote:
response.css('div.quote span')
3. XPath Selectors: The Powerhouse
XPath looks a bit like a file path on your computer.
Selecting by Tag
To select all div elements:
response.xpath('//div')
//means "search anywhere in the document"./means "direct child" (must be immediately inside).
Selecting by Attribute
To select a div with class="quote":
response.xpath('//div[@class="quote"]')
@is used for attributes (class, id, href, src, etc.).
Selecting by Text
This is where XPath shines. To select a button that says "Next Page":
response.xpath('//button[text()="Next Page"]')
Contains (Partial Match)
If the class is product-item active and you just want to match product-item:
response.xpath('//div[contains(@class, "product-item")]')
Or matching text that contains "Price":
response.xpath('//span[contains(text(), "Price")]')
4. Extracting Data: Getting the Good Stuff
Once you've selected the element, you need to extract the data (text, link, etc.).
Getting Text
CSS:
response.css('span.text::text').get()
XPath:
response.xpath('//span[@class="text"]/text()').get()
Getting Attributes (Links, Images)
To get the URL from <a href="https://example.com">:
CSS:
response.css('a::attr(href)').get()
XPath:
response.xpath('//a/@href').get()
get() vs getall()
get(): Returns the first match as a string.getall(): Returns all matches as a list of strings.
# Get all quotes on the page
quotes = response.css('div.quote span.text::text').getall()
5. Advanced Filtering and Logic
Sometimes simple selection isn't enough.
"OR" Logic
Select h1 OR h2 tags:
response.xpath('//h1 | //h2')
"AND" Logic
Select a div that has BOTH class="item" AND data-id="123":
response.xpath('//div[@class="item" and @data-id="123"]')
Selecting Based on Position
Select the first item in a list:
response.xpath('//ul/li[1]')
Select the last item:
response.xpath('//ul/li[last()]')
Selecting Siblings (Neighbors)
Imagine this HTML:
<div class="label">Price:</div>
<div class="value">$50</div>
You want the price, but it has no unique class. You can find the "Price:" label and get the next element.
response.xpath('//div[text()="Price:"]/following-sibling::div[1]/text()').get()
Selecting Parent
You found a "Buy Now" button and want to get the product title, which is in a parent container.
response.xpath('//button[@class="buy-now"]/../h2/text()').get()
..moves up to the parent.
6. Real-World Cheat Sheet
| Goal | CSS Example | XPath Example |
| Get ID | #header | //*[@id="header"] |
| Get Class | .item | //*[@class="item"] |
| Get Attribute | a::attr(href) | //a/@href |
| Get Text | p::text | //p/text() |
| Contains Text | Not supported | //div[contains(text(), "Hello")] |
| Parent | Not supported | //div/.. |
| Next Sibling | div + span | //div/following-sibling::span[1] |
7. How to Practice
Open any website (e.g.,
quotes.toscrape.com).Open your terminal and run:
scrapy shell "https://quotes.toscrape.com"Try typing these commands:
>>> response.css('title::text').get() 'Quotes to Scrape' >>> response.xpath('//span[@class="text"]/text()').get() '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
Conclusion
CSS is great for speed and simplicity. XPath is essential for complex navigation (parents, siblings, text matching). Mastering both gives you the superpower to scrape almost any website!