How to Set Up a Scrapy Project: A Beginner's Guide
Creating a New Scrapy Project
Once Scrapy is installed, the first step is to set up a new project. Navigate to the directory where you want to store your code and run:
scrapy startproject myproject
This will create a myproject directory with the following structure:
myproject/
scrapy.cfg # deploy configuration file
myproject/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Understanding the Project Structure
scrapy.cfg: The project configuration file. It defines the project settings module.items.py: Defines the data structures (containers) for the scraped data, similar to Django models.middlewares.py: Hooks to process requests and responses globally.pipelines.py: Processes the scraped items (e.g., cleaning data, saving to a database).settings.py: Contains project settings like user agent, download delay, and enabled pipelines.spiders/: This is where your "spiders" (the classes that define how to scrape a site) will live.
Basic Scrapy Commands
Scrapy provides a command-line tool to control your project. Here are some common commands:
scrapy shell [url]: Opens an interactive shell to try out selectors and debug.scrapy crawl [spider_name]: Runs a spider.scrapy genspider [name] [domain]: Generates a new spider file.
Your First Spider
Let's create a simple spider to scrape quotes from quotes.toscrape.com.
Navigate into your project:
cd myprojectGenerate a spider:
scrapy genspider quotesquotes.toscrape.com
This creates myproject/spiders/quotes.py. Let's edit it:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
Running the Spider
To run the spider and save the output to a JSON file:
scrapy crawl quotes -O quotes.json
This command runs the quotes spider and outputs the results to quotes.json.
Next Steps
In the next article, we will compare Scrapy with other tools like Selenium and Playwright to understand when to use which.