README.md

# Telegram analytics toolkit

Scrape public Telegram channel posts, fetch replies and forwards, and generate rich analytics reports with tagging, sentiment, matchday overlays, and plots. Use VADER, a local transformers model, or a local GPT (Ollama) backend for sentiment.

Highlights:
- Fast replies scraping with concurrency, resume/append, and rate-limit visibility
- Forwards scanning with chunked, concurrent search
- Analyzer: tagging from YAML keywords; sentiment via VADER, transformers, or local GPT; emoji-aware modes; combined posts+replies metrics; and matchday cross-analysis
- Plots: daily activity with in-plot match labels, daily volume vs sentiment (new), heatmaps, and per-tag (team) sentiment shares
- Local learning: fine-tune and evaluate a transformers classifier and use it in analysis

Full command reference is in `docs/COMMANDS.md`.

## Quick start

1) Configure secrets in `.env` (script will prompt if absent):
```
TELEGRAM_API_ID=123456
TELEGRAM_API_HASH=your_api_hash
# Optional
TELEGRAM_SESSION_NAME=telegram
TELEGRAM_2FA_PASSWORD=your_2fa_password
FOOTBALL_DATA_API_TOKEN=your_token
```

2) Run any command via the wrapper (creates venv and installs deps automatically):

```zsh
# Fetch messages to CSV
./run_scraper.sh scrape -c https://t.me/Premier_League_Update -o data/premier_league_update.csv --start-date 2025-08-15 --end-date 2025-10-15

# Fetch replies fast
./run_scraper.sh replies -c https://t.me/Premier_League_Update --from-csv data/premier_league_update.csv -o data/premier_league_replies.csv --min-replies 1 --concurrency 15 --resume --append

# Analyze with tags, fixtures, emoji handling and plots
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv --tags-config config/tags.yaml --write-augmented-csv --write-combined-csv --emoji-mode keep --emoji-boost --save-plots
```

3) Use transformers sentiment instead of VADER:

```zsh
# Off-the-shelf fine-tuned sentiment head
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
  --sentiment-backend transformers \
  --transformers-model distilbert-base-uncased-finetuned-sst-2-english \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv --save-plots
```

4) Use a local GPT backend (Ollama) for sentiment (JSON labels+confidence mapped to a compound score):

```zsh
# Ensure Ollama is running locally and the model is available (e.g., llama3)
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
  --sentiment-backend gpt \
  --gpt-model llama3 \
  --gpt-base-url http://localhost:11434 \
  --write-augmented-csv --write-combined-csv --save-plots
```

## Aliases

Convenient zsh functions live in `scripts/aliases.zsh`:

- `fast_replies` — resume+append replies with concurrency
- `chunked_forwards` — concurrent forwards scan
- `analyze_combined` — posts+replies+fixtures with tags
- `analyze_emoji` — emoji-aware analyze with boost
- `analyze_transformers` — analyze with transformers and export details
- `apply_labels_and_analyze` — merge a labeled CSV into posts/replies and run analyzer (reuses sentiment_label)
- `plot_labeled` — QA plots from a labeled CSV (class distribution, confidence, lengths)
- `train_transformers` — fine-tune a model on a labeled CSV
- `eval_transformers` — evaluate a fine-tuned model

Source them:
```zsh
source scripts/aliases.zsh
```

## Local transformers (optional)

Train a classifier:
```zsh
./.venv/bin/python -m src.train_sentiment \
  --train-csv data/labeled_sentiment.csv \
  --text-col message \
  --label-col label \
  --model-name distilbert-base-uncased \
  --output-dir models/sentiment-distilbert \
  --epochs 3 --batch-size 16
```

Evaluate it:
```zsh
./.venv/bin/python -m src.eval_sentiment \
  --csv data/labeled_holdout.csv \
  --text-col message \
  --label-col label \
  --model models/sentiment-distilbert
```

Use it in analyze:
```zsh
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
  --sentiment-backend transformers \
  --transformers-model models/sentiment-distilbert \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv --save-plots
```

Notes:
- GPU/Apple Silicon (MPS) is auto-detected; CPU is the fallback.
- Torch pinning in `requirements.txt` uses conditional versions for smooth installs across Python versions.

## Plots produced (when --save-plots is used)

- `daily_activity_stacked.png` — stacked bar chart of posts vs replies per day.
  - Dynamic sizing: `--plot-width-scale`, `--plot-max-width`, `--plot-height`
  - Top-N highlights: `--activity-top-n` (labels show total and posts+replies breakdown)
  - Match labels inside the plot using team abbreviations; control density with:
    - `--labels-max-per-day`, `--labels-per-line`, `--labels-stagger-rows`, `--labels-band-y`, `--labels-annotate-mode`
- `daily_volume_and_sentiment.png` — total volume (posts+replies) per day as bars (left Y) and positive%/negative% as lines (right Y). Uses `sentiment_label` when present, otherwise `sentiment_compound` thresholds.
- `posts_heatmap_hour_dow.png` — heatmap of posts activity by hour and day-of-week.
- `sentiment_by_tag_posts.png` — stacked shares of pos/neu/neg by team tag (tags starting with `club_`), with dynamic width.
- Matchday rollups (when fixtures are provided):
  - `matchday_sentiment_overall.csv` — per-fixture-day aggregates for posts (and replies when provided)
  - `matchday_sentiment_overall.png` — mean sentiment time series on matchdays (posts, replies)
  - `matchday_posts_volume_vs_sentiment.png` — scatter of posts volume vs mean sentiment on matchdays
- Diagnostics:
  - `match_labels_debug.csv` — per-day list of rendered match labels (helps tune label density)

Tip: The analyzer adapts plot width to the number of days; for very long ranges, raise `--plot-max-width`.

## Plot sizing and label flags (analyze)

- `--plot-width-scale` (default 0.8): inches per day for the daily charts width.
- `--plot-max-width` (default 104): cap on width in inches.
- `--plot-height` (default 6.5): figure height in inches.
- `--activity-top-n` (default 5): highlight top-N activity days; 0 disables.
- Match label controls:
  - `--labels-max-per-day` (default 3): cap labels per day (+N more).
  - `--labels-per-line` (default 2): labels per line in the band.
  - `--labels-band-y` (default 0.96): vertical position of the band (axes coords).
  - `--labels-stagger-rows` (default 2): stagger rows to reduce collisions.
  - `--labels-annotate-mode` (ticks|all|ticks+top): which x positions get labels.

## Automatic labeling (no manual annotation)

If you don't want to label data by hand, generate a labeled training set automatically and train a local model.

Label with VADER (fast) or a pretrained transformers model (higher quality):

```zsh
# Load aliases
source scripts/aliases.zsh

# VADER: keeps only confident predictions by default
auto_label_vader

# Or Transformers: CardiffNLP 3-class sentiment (keeps confident only)
auto_label_transformers

# Output: data/labeled_sentiment.csv (message, label, confidence, ...)
```

Then fine-tune a classifier on the generated labels and use it in analysis:

```zsh
# Train on the auto-labeled CSV
train_transformers

# Analyze using your fine-tuned model
./run_scraper.sh analyze -i data/premier_league_update.csv \
  --replies-csv data/premier_league_replies.csv \
  --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \
  --tags-config config/tags.yaml \
  --sentiment-backend transformers \
  --transformers-model models/sentiment-distilbert \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv --save-plots
```

Advanced knobs (optional):
- VADER thresholds: `--vader-pos 0.05 --vader-neg -0.05 --vader-margin 0.2`
- Transformers acceptance: `--min-prob 0.6 --min-margin 0.2`
- Keep all predictions (not just confident): remove `--only-confident`

## Local GPT backend (Ollama)

You can use a local GPT model for sentiment. The analyzer requests strict JSON `{label, confidence}` and maps it to a compound score. If the GPT call fails for any rows, it gracefully falls back to VADER for those rows.

Example:
```zsh
./run_scraper.sh analyze -i data/premier_league_update.csv \
  --replies-csv data/premier_league_replies.csv \
  --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \
  --tags-config config/tags.yaml \
  --sentiment-backend gpt \
  --gpt-model llama3 \
  --gpt-base-url http://localhost:11434 \
  --write-augmented-csv --write-combined-csv --save-plots
```

## License
MIT (adjust as needed)
chore(repo): initialize git with .gitignore, .gitattributes, and project sources 2025-10-26 08:56:41 -04:00			`# Telegram analytics toolkit`

			`Scrape public Telegram channel posts, fetch replies and forwards, and generate rich analytics reports with tagging, sentiment, matchday overlays, and plots. Use VADER, a local transformers model, or a local GPT (Ollama) backend for sentiment.`

			`Highlights:`
			`- Fast replies scraping with concurrency, resume/append, and rate-limit visibility`
			`- Forwards scanning with chunked, concurrent search`
			`- Analyzer: tagging from YAML keywords; sentiment via VADER, transformers, or local GPT; emoji-aware modes; combined posts+replies metrics; and matchday cross-analysis`
			`- Plots: daily activity with in-plot match labels, daily volume vs sentiment (new), heatmaps, and per-tag (team) sentiment shares`
			`- Local learning: fine-tune and evaluate a transformers classifier and use it in analysis`

			Full command reference is in `docs/COMMANDS.md`.

			`## Quick start`

			1) Configure secrets in `.env` (script will prompt if absent):
			```
			`TELEGRAM_API_ID=123456`
			`TELEGRAM_API_HASH=your_api_hash`
			`# Optional`
			`TELEGRAM_SESSION_NAME=telegram`
			`TELEGRAM_2FA_PASSWORD=your_2fa_password`
			`FOOTBALL_DATA_API_TOKEN=your_token`
			```

			`2) Run any command via the wrapper (creates venv and installs deps automatically):`

			```zsh
			`# Fetch messages to CSV`
			`./run_scraper.sh scrape -c https://t.me/Premier_League_Update -o data/premier_league_update.csv --start-date 2025-08-15 --end-date 2025-10-15`

			`# Fetch replies fast`
			`./run_scraper.sh replies -c https://t.me/Premier_League_Update --from-csv data/premier_league_update.csv -o data/premier_league_replies.csv --min-replies 1 --concurrency 15 --resume --append`

			`# Analyze with tags, fixtures, emoji handling and plots`
			`./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv --tags-config config/tags.yaml --write-augmented-csv --write-combined-csv --emoji-mode keep --emoji-boost --save-plots`
			```

			`3) Use transformers sentiment instead of VADER:`

			```zsh
			`# Off-the-shelf fine-tuned sentiment head`
			`./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \`
			`--sentiment-backend transformers \`
			`--transformers-model distilbert-base-uncased-finetuned-sst-2-english \`
			`--export-transformers-details \`
			`--write-augmented-csv --write-combined-csv --save-plots`
			```

			`4) Use a local GPT backend (Ollama) for sentiment (JSON labels+confidence mapped to a compound score):`

			```zsh
			`# Ensure Ollama is running locally and the model is available (e.g., llama3)`
			`./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \`
			`--sentiment-backend gpt \`
			`--gpt-model llama3 \`
			`--gpt-base-url http://localhost:11434 \`
			`--write-augmented-csv --write-combined-csv --save-plots`
			```

			`## Aliases`

			Convenient zsh functions live in `scripts/aliases.zsh`:

			- `fast_replies` — resume+append replies with concurrency
			- `chunked_forwards` — concurrent forwards scan
			- `analyze_combined` — posts+replies+fixtures with tags
			- `analyze_emoji` — emoji-aware analyze with boost
			- `analyze_transformers` — analyze with transformers and export details
			- `apply_labels_and_analyze` — merge a labeled CSV into posts/replies and run analyzer (reuses sentiment_label)
			- `plot_labeled` — QA plots from a labeled CSV (class distribution, confidence, lengths)
			- `train_transformers` — fine-tune a model on a labeled CSV
			- `eval_transformers` — evaluate a fine-tuned model

			`Source them:`
			```zsh
			`source scripts/aliases.zsh`
			```

			`## Local transformers (optional)`

			`Train a classifier:`
			```zsh
			`./.venv/bin/python -m src.train_sentiment \`
			`--train-csv data/labeled_sentiment.csv \`
			`--text-col message \`
			`--label-col label \`
			`--model-name distilbert-base-uncased \`
			`--output-dir models/sentiment-distilbert \`
			`--epochs 3 --batch-size 16`
			```

			`Evaluate it:`
			```zsh
			`./.venv/bin/python -m src.eval_sentiment \`
			`--csv data/labeled_holdout.csv \`
			`--text-col message \`
			`--label-col label \`
			`--model models/sentiment-distilbert`
			```

			`Use it in analyze:`
			```zsh
			`./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \`
			`--sentiment-backend transformers \`
			`--transformers-model models/sentiment-distilbert \`
			`--export-transformers-details \`
			`--write-augmented-csv --write-combined-csv --save-plots`
			```

			`Notes:`
			`- GPU/Apple Silicon (MPS) is auto-detected; CPU is the fallback.`
			- Torch pinning in `requirements.txt` uses conditional versions for smooth installs across Python versions.

			`## Plots produced (when --save-plots is used)`

			- `daily_activity_stacked.png` — stacked bar chart of posts vs replies per day.
			- Dynamic sizing: `--plot-width-scale`, `--plot-max-width`, `--plot-height`
			- Top-N highlights: `--activity-top-n` (labels show total and posts+replies breakdown)
			`- Match labels inside the plot using team abbreviations; control density with:`
			- `--labels-max-per-day`, `--labels-per-line`, `--labels-stagger-rows`, `--labels-band-y`, `--labels-annotate-mode`
			- `daily_volume_and_sentiment.png` — total volume (posts+replies) per day as bars (left Y) and positive%/negative% as lines (right Y). Uses `sentiment_label` when present, otherwise `sentiment_compound` thresholds.
			- `posts_heatmap_hour_dow.png` — heatmap of posts activity by hour and day-of-week.
			- `sentiment_by_tag_posts.png` — stacked shares of pos/neu/neg by team tag (tags starting with `club_`), with dynamic width.
			`- Matchday rollups (when fixtures are provided):`
			- `matchday_sentiment_overall.csv` — per-fixture-day aggregates for posts (and replies when provided)
			- `matchday_sentiment_overall.png` — mean sentiment time series on matchdays (posts, replies)
			- `matchday_posts_volume_vs_sentiment.png` — scatter of posts volume vs mean sentiment on matchdays
			`- Diagnostics:`
			- `match_labels_debug.csv` — per-day list of rendered match labels (helps tune label density)

			Tip: The analyzer adapts plot width to the number of days; for very long ranges, raise `--plot-max-width`.

			`## Plot sizing and label flags (analyze)`

			- `--plot-width-scale` (default 0.8): inches per day for the daily charts width.
			- `--plot-max-width` (default 104): cap on width in inches.
			- `--plot-height` (default 6.5): figure height in inches.
			- `--activity-top-n` (default 5): highlight top-N activity days; 0 disables.
			`- Match label controls:`
			- `--labels-max-per-day` (default 3): cap labels per day (+N more).
			- `--labels-per-line` (default 2): labels per line in the band.
			- `--labels-band-y` (default 0.96): vertical position of the band (axes coords).
			- `--labels-stagger-rows` (default 2): stagger rows to reduce collisions.
			- `--labels-annotate-mode` (ticks\|all\|ticks+top): which x positions get labels.

			`## Automatic labeling (no manual annotation)`

			`If you don't want to label data by hand, generate a labeled training set automatically and train a local model.`

			`Label with VADER (fast) or a pretrained transformers model (higher quality):`

			```zsh
			`# Load aliases`
			`source scripts/aliases.zsh`

			`# VADER: keeps only confident predictions by default`
			`auto_label_vader`

			`# Or Transformers: CardiffNLP 3-class sentiment (keeps confident only)`
			`auto_label_transformers`

			`# Output: data/labeled_sentiment.csv (message, label, confidence, ...)`
			```

			`Then fine-tune a classifier on the generated labels and use it in analysis:`

			```zsh
			`# Train on the auto-labeled CSV`
			`train_transformers`

			`# Analyze using your fine-tuned model`
			`./run_scraper.sh analyze -i data/premier_league_update.csv \`
			`--replies-csv data/premier_league_replies.csv \`
			`--fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \`
			`--tags-config config/tags.yaml \`
			`--sentiment-backend transformers \`
			`--transformers-model models/sentiment-distilbert \`
			`--export-transformers-details \`
			`--write-augmented-csv --write-combined-csv --save-plots`
			```

			`Advanced knobs (optional):`
			- VADER thresholds: `--vader-pos 0.05 --vader-neg -0.05 --vader-margin 0.2`
			- Transformers acceptance: `--min-prob 0.6 --min-margin 0.2`
			- Keep all predictions (not just confident): remove `--only-confident`

			`## Local GPT backend (Ollama)`

			You can use a local GPT model for sentiment. The analyzer requests strict JSON `{label, confidence}` and maps it to a compound score. If the GPT call fails for any rows, it gracefully falls back to VADER for those rows.

			`Example:`
			```zsh
			`./run_scraper.sh analyze -i data/premier_league_update.csv \`
			`--replies-csv data/premier_league_replies.csv \`
			`--fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \`
			`--tags-config config/tags.yaml \`
			`--sentiment-backend gpt \`
			`--gpt-model llama3 \`
			`--gpt-base-url http://localhost:11434 \`
			`--write-augmented-csv --write-combined-csv --save-plots`
			```

			`## License`
			`MIT (adjust as needed)`