Telegram analytics toolkit

Scrape public Telegram channel posts, fetch replies and forwards, and generate rich analytics reports with tagging, sentiment, matchday overlays, and plots. Use VADER, a local transformers model, or a local GPT (Ollama) backend for sentiment.

Highlights:

  • Fast replies scraping with concurrency, resume/append, and rate-limit visibility
  • Forwards scanning with chunked, concurrent search
  • Analyzer: tagging from YAML keywords; sentiment via VADER, transformers, or local GPT; emoji-aware modes; combined posts+replies metrics; and matchday cross-analysis
  • Plots: daily activity with in-plot match labels, daily volume vs sentiment (new), heatmaps, and per-tag (team) sentiment shares
  • Local learning: fine-tune and evaluate a transformers classifier and use it in analysis

Full command reference is in docs/COMMANDS.md.

Quick start

  1. Configure secrets in .env (script will prompt if absent):
TELEGRAM_API_ID=123456
TELEGRAM_API_HASH=your_api_hash
# Optional
TELEGRAM_SESSION_NAME=telegram
TELEGRAM_2FA_PASSWORD=your_2fa_password
FOOTBALL_DATA_API_TOKEN=your_token
  1. Run any command via the wrapper (creates venv and installs deps automatically):
# Fetch messages to CSV
./run_scraper.sh scrape -c https://t.me/Premier_League_Update -o data/premier_league_update.csv --start-date 2025-08-15 --end-date 2025-10-15

# Fetch replies fast
./run_scraper.sh replies -c https://t.me/Premier_League_Update --from-csv data/premier_league_update.csv -o data/premier_league_replies.csv --min-replies 1 --concurrency 15 --resume --append

# Analyze with tags, fixtures, emoji handling and plots
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv --tags-config config/tags.yaml --write-augmented-csv --write-combined-csv --emoji-mode keep --emoji-boost --save-plots
  1. Use transformers sentiment instead of VADER:
# Off-the-shelf fine-tuned sentiment head
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
  --sentiment-backend transformers \
  --transformers-model distilbert-base-uncased-finetuned-sst-2-english \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv --save-plots
  1. Use a local GPT backend (Ollama) for sentiment (JSON labels+confidence mapped to a compound score):
# Ensure Ollama is running locally and the model is available (e.g., llama3)
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
  --sentiment-backend gpt \
  --gpt-model llama3 \
  --gpt-base-url http://localhost:11434 \
  --write-augmented-csv --write-combined-csv --save-plots

Aliases

Convenient zsh functions live in scripts/aliases.zsh:

  • fast_replies — resume+append replies with concurrency
  • chunked_forwards — concurrent forwards scan
  • analyze_combined — posts+replies+fixtures with tags
  • analyze_emoji — emoji-aware analyze with boost
  • analyze_transformers — analyze with transformers and export details
  • apply_labels_and_analyze — merge a labeled CSV into posts/replies and run analyzer (reuses sentiment_label)
  • plot_labeled — QA plots from a labeled CSV (class distribution, confidence, lengths)
  • train_transformers — fine-tune a model on a labeled CSV
  • eval_transformers — evaluate a fine-tuned model

Source them:

source scripts/aliases.zsh

Local transformers (optional)

Train a classifier:

./.venv/bin/python -m src.train_sentiment \
  --train-csv data/labeled_sentiment.csv \
  --text-col message \
  --label-col label \
  --model-name distilbert-base-uncased \
  --output-dir models/sentiment-distilbert \
  --epochs 3 --batch-size 16

Evaluate it:

./.venv/bin/python -m src.eval_sentiment \
  --csv data/labeled_holdout.csv \
  --text-col message \
  --label-col label \
  --model models/sentiment-distilbert

Use it in analyze:

./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
  --sentiment-backend transformers \
  --transformers-model models/sentiment-distilbert \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv --save-plots

Notes:

  • GPU/Apple Silicon (MPS) is auto-detected; CPU is the fallback.
  • Torch pinning in requirements.txt uses conditional versions for smooth installs across Python versions.

Plots produced (when --save-plots is used)

  • daily_activity_stacked.png — stacked bar chart of posts vs replies per day.
    • Dynamic sizing: --plot-width-scale, --plot-max-width, --plot-height
    • Top-N highlights: --activity-top-n (labels show total and posts+replies breakdown)
    • Match labels inside the plot using team abbreviations; control density with:
      • --labels-max-per-day, --labels-per-line, --labels-stagger-rows, --labels-band-y, --labels-annotate-mode
  • daily_volume_and_sentiment.png — total volume (posts+replies) per day as bars (left Y) and positive%/negative% as lines (right Y). Uses sentiment_label when present, otherwise sentiment_compound thresholds.
  • posts_heatmap_hour_dow.png — heatmap of posts activity by hour and day-of-week.
  • sentiment_by_tag_posts.png — stacked shares of pos/neu/neg by team tag (tags starting with club_), with dynamic width.
  • Matchday rollups (when fixtures are provided):
    • matchday_sentiment_overall.csv — per-fixture-day aggregates for posts (and replies when provided)
    • matchday_sentiment_overall.png — mean sentiment time series on matchdays (posts, replies)
    • matchday_posts_volume_vs_sentiment.png — scatter of posts volume vs mean sentiment on matchdays
  • Diagnostics:
    • match_labels_debug.csv — per-day list of rendered match labels (helps tune label density)

Tip: The analyzer adapts plot width to the number of days; for very long ranges, raise --plot-max-width.

Plot sizing and label flags (analyze)

  • --plot-width-scale (default 0.8): inches per day for the daily charts width.
  • --plot-max-width (default 104): cap on width in inches.
  • --plot-height (default 6.5): figure height in inches.
  • --activity-top-n (default 5): highlight top-N activity days; 0 disables.
  • Match label controls:
    • --labels-max-per-day (default 3): cap labels per day (+N more).
    • --labels-per-line (default 2): labels per line in the band.
    • --labels-band-y (default 0.96): vertical position of the band (axes coords).
    • --labels-stagger-rows (default 2): stagger rows to reduce collisions.
    • --labels-annotate-mode (ticks|all|ticks+top): which x positions get labels.

Automatic labeling (no manual annotation)

If you don't want to label data by hand, generate a labeled training set automatically and train a local model.

Label with VADER (fast) or a pretrained transformers model (higher quality):

# Load aliases
source scripts/aliases.zsh

# VADER: keeps only confident predictions by default
auto_label_vader

# Or Transformers: CardiffNLP 3-class sentiment (keeps confident only)
auto_label_transformers

# Output: data/labeled_sentiment.csv (message, label, confidence, ...)

Then fine-tune a classifier on the generated labels and use it in analysis:

# Train on the auto-labeled CSV
train_transformers

# Analyze using your fine-tuned model
./run_scraper.sh analyze -i data/premier_league_update.csv \
  --replies-csv data/premier_league_replies.csv \
  --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \
  --tags-config config/tags.yaml \
  --sentiment-backend transformers \
  --transformers-model models/sentiment-distilbert \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv --save-plots

Advanced knobs (optional):

  • VADER thresholds: --vader-pos 0.05 --vader-neg -0.05 --vader-margin 0.2
  • Transformers acceptance: --min-prob 0.6 --min-margin 0.2
  • Keep all predictions (not just confident): remove --only-confident

Local GPT backend (Ollama)

You can use a local GPT model for sentiment. The analyzer requests strict JSON {label, confidence} and maps it to a compound score. If the GPT call fails for any rows, it gracefully falls back to VADER for those rows.

Example:

./run_scraper.sh analyze -i data/premier_league_update.csv \
  --replies-csv data/premier_league_replies.csv \
  --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \
  --tags-config config/tags.yaml \
  --sentiment-backend gpt \
  --gpt-model llama3 \
  --gpt-base-url http://localhost:11434 \
  --write-augmented-csv --write-combined-csv --save-plots

License

MIT (adjust as needed)

Description
English Premier League Updates - A Telegram Public Channel Data Analysis
Readme GPL-3.0 98 KiB
Languages
Python 85.9%
Shell 14.1%