No description
Find a file
2025-05-30 19:15:19 +00:00
README.md Upload files to "/" 2025-05-30 19:15:19 +00:00
scraper.py Upload files to "/" 2025-05-30 19:15:19 +00:00

Sitemap Meta Scraper

A Python script that extracts meta titles and descriptions from all URLs in a sitemap and saves them to a CSV file.

Features

  • 🌐 Fetches and parses XML sitemaps
  • 📊 Extracts meta titles and descriptions from each URL
  • 💾 Outputs data to a CSV file with proper formatting
  • 📝 Writes data to CSV as it processes each URL (not just at the end)
  • 🚀 Shows progress with a progress bar
  • 🎉 Logs operations with fun emojis

Requirements

This script uses Nix to manage dependencies. The following Python packages are used:

  • requests
  • beautifulsoup4
  • lxml
  • tqdm
  • colorama

Usage

Run the script from the command line with a sitemap URL and optional output filename:

./scraper.py SITEMAP_URL [OUTPUT_FILE]

Examples

# Basic usage with default output filename (meta_data.csv)
./scraper.py https://example.com/sitemap.xml

# Specify a custom output filename
./scraper.py https://example.com/sitemap.xml my_data.csv

Output Format

The script generates a CSV file with the following columns:

  • URL
  • Meta Title
  • Meta Description

Values are enclosed in quotes when necessary, following CSV best practices.

How It Works

  1. The script fetches the provided sitemap URL
  2. It parses the XML to extract all page URLs
  3. For each URL found in the sitemap:
    • The page is downloaded
    • Meta title is extracted from the <title> tag
    • Meta description is extracted from the <meta name="description"> tag
  4. All data is written to a CSV file
  5. Progress is displayed with a progress bar and emoji-decorated log messages

Performance Notes

  • The script includes a small delay between requests to avoid overwhelming servers
  • If a page can't be fetched or parsed, the script continues with empty values for that URL