No description

Find a file

stefan 5514b52eeb Upload files to "/"		2025-05-30 19:15:19 +00:00
README.md	Upload files to "/"	2025-05-30 19:15:19 +00:00
scraper.py	Upload files to "/"	2025-05-30 19:15:19 +00:00

README.md

Sitemap Meta Scraper

A Python script that extracts meta titles and descriptions from all URLs in a sitemap and saves them to a CSV file.

Features

🌐 Fetches and parses XML sitemaps
📊 Extracts meta titles and descriptions from each URL
💾 Outputs data to a CSV file with proper formatting
📝 Writes data to CSV as it processes each URL (not just at the end)
🚀 Shows progress with a progress bar
🎉 Logs operations with fun emojis

Requirements

This script uses Nix to manage dependencies. The following Python packages are used:

requests
beautifulsoup4
lxml
tqdm
colorama

Usage

Run the script from the command line with a sitemap URL and optional output filename:

./scraper.py SITEMAP_URL [OUTPUT_FILE]

Examples

# Basic usage with default output filename (meta_data.csv)
./scraper.py https://example.com/sitemap.xml

# Specify a custom output filename
./scraper.py https://example.com/sitemap.xml my_data.csv

Output Format

The script generates a CSV file with the following columns:

URL
Meta Title
Meta Description

Values are enclosed in quotes when necessary, following CSV best practices.

How It Works

The script fetches the provided sitemap URL
It parses the XML to extract all page URLs
For each URL found in the sitemap:
- The page is downloaded
- Meta title is extracted from the <title> tag
- Meta description is extracted from the <meta name="description"> tag
All data is written to a CSV file
Progress is displayed with a progress bar and emoji-decorated log messages

Performance Notes

The script includes a small delay between requests to avoid overwhelming servers
If a page can't be fetched or parsed, the script continues with empty values for that URL