GitHub Scraper

Generate professional CVs from GitHub profiles with comprehensive data scraping and analytics.

🚀 Features

Core Functionality

📄 Markdown Output - Professional, GitHub-compatible markdown with rich data analysis
🎯 Data-Focused - Comprehensive data scraping and insights without presentation complexity
📊 Activity Scoring - Comprehensive scoring based on stars, forks, and activity
💻 Language Analysis - Programming language expertise with proficiency levels
🚀 Project Showcase - Smart repository selection with diversity analysis

Website Enrichment

📝 Personal Website Integration - Automatically discovers and analyzes personal websites
🔍 Smart URL Discovery - Finds websites from GitHub profile, bio, and repositories
🤖 Firecrawl Integration - Uses advanced web scraping with AI-powered extraction
📊 Enhanced Profiles - Adds skills, experience, and projects from personal websites
🛡️ Intelligent Filtering - Focuses on personal websites, filters out social media

Deeper GitHub Signals

🔍 PR Review Analytics - Comprehensive pull request review analysis and approval ratios
🎯 Issue Engagement - Track issues opened, closed, and comment participation
💬 Discussion Activity - Repository discussions started and engagement metrics
📋 Project Management - Project board item creation and management activity
💰 Sponsorship Status - GitHub Sponsors enablement and funding opportunities

LinkedIn Professional Enrichment

📎 Smart Profile Discovery - Automatically finds LinkedIn profiles from GitHub bio and repositories
🎓 Education Background - Extracts degrees, institutions, and academic history
🏆 Professional Certifications - Captures certifications with issuing organizations and dates
💼 Work Experience - Professional positions, companies, and career progression
🔗 Data Correlation - Cross-references LinkedIn and GitHub data for consistency
📊 Professional Analytics - Enhanced archetype analysis with formal credentials

Technical Features

⚡ Intelligent Caching - API response caching for faster subsequent runs
🎛️ Configurable Options - Extensive customization via config files
🔄 Rate Limit Friendly - Optimized API usage with GitHub token support
🔍 Deep Analysis - Comprehensive profile and repository analysis
📁 Organized Output - Automatic folder organization by user and date

📦 Installation

🚀 Quick Setup with Environment Variables (Recommended)

Clone and setup:

git clone https://github.com/nikosmav/github-scraper.git
cd github-scraper
python setup.py  # Creates venv, installs dependencies, sets up .env

Configure API keys:

# Edit the .env file with your API keys
cp env.example .env
# Then edit .env and add your tokens

Run with automatic environment loading:

# Windows
run.bat build username

# Unix/Linux/macOS
./run.sh build username

📁 Manual Setup (Alternative)

git clone https://github.com/nikosmav/github-scraper.git
cd github-scraper
pip install -e ".[dev,all]"

🔧 Environment Setup

🔑 API Keys Configuration

Create a .env file in the project root with your API keys:

# Copy the example file
cp env.example .env

# Edit .env and add your keys:
# GITHUB_TOKEN=your_github_token_here
# FIRECRAWL_API_KEY=your_firecrawl_api_key_here

Required API Keys:

GitHub Token (optional but recommended): Higher rate limits (5,000 vs 60 requests/hour)
- Get from: GitHub Settings > Tokens
- Required scopes: public_repo, user (and repo for private repos)
- Environment variables: GITHUB_TOKEN or GH_TOKEN
Firecrawl API Key (for website enrichment): Website scraping and LinkedIn data
- Get from: Firecrawl (free tier available)
- Required for: --enrich-websites, --include-linkedin, --full-profile
- Environment variable: FIRECRAWL_API_KEY

🖥️ Automatic Environment Loading

Use the provided runner scripts that automatically load your .env file:

# Windows PowerShell/Command Prompt
run.bat build username --full-profile

# Unix/Linux/macOS Terminal
./run.sh build username --full-profile

These scripts will:

✅ Activate the Python virtual environment
✅ Load environment variables from .env
✅ Run the program with all API keys ready
✅ Show helpful error messages if setup is incomplete

🛠️ Manual Environment Loading

If you prefer manual control:

# Activate virtual environment
source .venv/bin/activate  # Unix/Linux/macOS
# or
.venv\Scripts\activate.bat  # Windows

# Load environment variables
export $(cat .env | xargs)  # Unix/Linux/macOS
# or manually set in Windows PowerShell:
# $env:GITHUB_TOKEN="your_token_here"
# $env:FIRECRAWL_API_KEY="your_key_here"

# Run the program
github-scraper build username --full-profile

🚀 Quick Start

With .env Setup (Recommended)

Once you've set up your .env file with API keys:

# Windows - Full profile with all features
run.bat build username --full-profile

# Unix/Linux/macOS - Full profile with all features
./run.sh build username --full-profile

# Basic usage (no API keys needed)
./run.sh build username

# With specific features
./run.sh build username --enrich-websites --include-linkedin

Manual Usage (Alternative)

# Generate comprehensive CV with data analysis
github-scraper build username

# Generate with website enrichment (requires Firecrawl API key)
github-scraper build username --enrich-websites --verbose

# Generate with deeper GitHub signals (requires GitHub token)
github-scraper build username --include-deeper-signals --token YOUR_GITHUB_TOKEN

# Generate with LinkedIn professional data (requires Firecrawl API key)
github-scraper build username --include-linkedin --verbose

# Generate full profile with all features
github-scraper build username --full-profile --token YOUR_GITHUB_TOKEN

# Generate with GitHub token for higher rate limits
github-scraper build username --token YOUR_GITHUB_TOKEN

Note: With .env setup, you don't need to pass --token manually - it's automatically loaded!

🌐 Website Enrichment Setup

Website enrichment requires a Firecrawl API key:

Get API Key: Sign up at Firecrawl for a free API key
Set Environment Variable:

# Linux/macOS
export FIRECRAWL_API_KEY="your_firecrawl_api_key_here"

# Windows
set FIRECRAWL_API_KEY=your_firecrawl_api_key_here

Use with enrichment:

github-scraper build username --enrich-websites --verbose

🔍 Deeper GitHub Signals Setup

Deeper GitHub signals require a GitHub Personal Access Token for GraphQL API access:

Get GitHub Token: Go to GitHub Settings > Tokens
Create Token: Generate a new token with repo and user scopes
Use with deeper signals:

github-scraper build username --include-deeper-signals --token YOUR_GITHUB_TOKEN

🔗 LinkedIn Enrichment Setup

LinkedIn enrichment also uses the Firecrawl API for reliable data extraction:

API Key Required: Uses the same Firecrawl API key as website enrichment
Set Environment Variable (if not already set):

# Linux/macOS
export FIRECRAWL_API_KEY="your_firecrawl_api_key_here"

# Windows
set FIRECRAWL_API_KEY=your_firecrawl_api_key_here

Use with LinkedIn enrichment:

github-scraper build username --include-linkedin --verbose

Note: LinkedIn profiles must be publicly accessible and linked from the GitHub profile (bio, blog field, or repository descriptions). Supports multiple URL formats:

Full URLs: https://linkedin.com/in/username
Domain URLs: linkedin.com/in/username
Partial URLs: in/username (like in/nikolaos-mavrapidis)

🚀 Full Profile Mode

Full profile mode includes everything (website + LinkedIn + deeper signals):

github-scraper build username --full-profile --token YOUR_GITHUB_TOKEN

Advanced Usage

# Custom output directory
github-scraper build username --output-dir team_profiles

# Single file output (bypasses organization)
github-scraper build username --output resume.md

# Force fresh download, bypass all caches
github-scraper build username --no-cache
# or use the refresh flag (same effect)
github-scraper build username --refresh

# Clean up old generated files
github-scraper clean --days 7

📂 Output Organization

By default, files are organized in structured folders:

generated_cvs/
├── username_2024-01-15/
│   └── username_comprehensive_cv.md
└── another_user_2024-01-16/
    └── another_user_comprehensive_cv.md

Organization Options:

Default: generated_cvs/{username}_{date}/
Custom directory: --output-dir custom_folder/
Single file: --output filename.md (bypasses organization)

🗂️ Caching System

The tool uses intelligent caching to speed up subsequent runs:

GitHub Profile Data: Cached for 1 hour in ~/.cache/github-scraper/
Website Content: Cached for 24 hours in ~/.cache/github-scraper/websites/
Deeper Signals: Cached for 2 hours in ~/.cache/github-scraper/

Cache Control:

# Use cache (default behavior)
github-scraper build username

# Force fresh download, bypass all caches
github-scraper build username --no-cache
github-scraper build username --refresh  # Same as --no-cache

⚙️ Configuration

Create a github-scraper.json file:

{
  "github": {
    "cache_duration_hours": 2,
    "max_repos": 20,
    "include_forks": false
  },
  "cv": {
    "max_featured_repos": 15,
    "include_insights": true,
    "activity_threshold_days": 90
  },
  "output": {
    "include_timestamp": true
  },
  "scraping": {
    "enable_website_enrichment": true,
    "max_websites_per_profile": 5
  }
}

📋 CLI Commands

`build` - Generate CV

github-scraper build <username> [OPTIONS]

Options:
  -o, --output PATH          Output file path
  -d, --output-dir PATH      Output directory for organized files
  --enrich-websites          Enable website enrichment
  --include-deeper-signals   Include deeper GitHub signals (PR reviews, issues, discussions, projects)
  --include-linkedin         Include LinkedIn professional profile data (headline, education, certifications)
  --full-profile             Include all features: website enrichment + deeper signals + LinkedIn
  --token TEXT               GitHub personal access token
  -c, --config PATH          Configuration file path
  --cache/--no-cache         Enable/disable caching (GitHub profiles cached 1h, websites 24h)
  --refresh                  Force fresh download, bypass all caches (same as --no-cache)
  -v, --verbose              Enable verbose output

Other Commands

# Clean up old files
github-scraper clean [username] --days 7 --yes

# Show configuration help
github-scraper config

# System diagnostic
github-scraper doctor

🔑 GitHub API Setup

Rate Limits:

Without token: 60 requests/hour (testing only)
With token: 5,000 requests/hour (recommended)

Setup:

Go to GitHub Settings > Tokens
Generate a new token with public_repo scope
Use: github-scraper build username --token YOUR_TOKEN

🎯 Output Content

Generated CVs include:

Data-Driven Content

Profile Overview: Enhanced with website and LinkedIn data when available
Technical Skills: Programming languages with expertise levels
Professional Credentials: Education background and certifications from LinkedIn
Featured Projects: Intelligently selected repositories with metrics
Activity Analysis: Contribution patterns and engagement metrics

Analytics & Insights

Activity Scoring: Based on stars, forks, and recent activity
Language Proficiency: Calculated from repository data
Professional Archetype: Enhanced analysis including formal education and certifications
Contribution Patterns: Community engagement analysis
Repository Trends: Creation and maintenance insights
Data Correlation: Cross-platform consistency analysis (GitHub vs LinkedIn)

🔧 Development

Running Tests

pip install -e ".[dev]"
pytest
pytest --cov=github_scraper --cov-report=html

Code Quality

black src/ tests/        # Format code
ruff check src/ tests/    # Lint code
mypy src/                 # Type checking

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PyGithub for GitHub API integration
Jinja2 for template rendering
Typer for the CLI interface
Firecrawl for website enrichment capabilities

Generate professional CVs from GitHub profiles with comprehensive data analysis! ✨

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
src/github_scraper		src/github_scraper
tests		tests
website_content		website_content
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pyproject.toml		pyproject.toml
run.bat		run.bat
run.sh		run.sh
setup.py		setup.py

NikosMav/github-scraper

Folders and files

Latest commit

History

Repository files navigation