No description
README.md | ||
scraper.py |
Sitemap Meta Scraper
A Python script that extracts meta titles and descriptions from all URLs in a sitemap and saves them to a CSV file.
Features
- 🌐 Fetches and parses XML sitemaps
- 📊 Extracts meta titles and descriptions from each URL
- 💾 Outputs data to a CSV file with proper formatting
- 📝 Writes data to CSV as it processes each URL (not just at the end)
- 🚀 Shows progress with a progress bar
- 🎉 Logs operations with fun emojis
Requirements
This script uses Nix to manage dependencies. The following Python packages are used:
- requests
- beautifulsoup4
- lxml
- tqdm
- colorama
Usage
Run the script from the command line with a sitemap URL and optional output filename:
./scraper.py SITEMAP_URL [OUTPUT_FILE]
Examples
# Basic usage with default output filename (meta_data.csv)
./scraper.py https://example.com/sitemap.xml
# Specify a custom output filename
./scraper.py https://example.com/sitemap.xml my_data.csv
Output Format
The script generates a CSV file with the following columns:
- URL
- Meta Title
- Meta Description
Values are enclosed in quotes when necessary, following CSV best practices.
How It Works
- The script fetches the provided sitemap URL
- It parses the XML to extract all page URLs
- For each URL found in the sitemap:
- The page is downloaded
- Meta title is extracted from the
<title>
tag - Meta description is extracted from the
<meta name="description">
tag
- All data is written to a CSV file
- Progress is displayed with a progress bar and emoji-decorated log messages
Performance Notes
- The script includes a small delay between requests to avoid overwhelming servers
- If a page can't be fetched or parsed, the script continues with empty values for that URL