chore(repo): initialize git with .gitignore, .gitattributes, and project sources
This commit is contained in:
1
.gitattributes
vendored
Normal file
1
.gitattributes
vendored
Normal file
@@ -0,0 +1 @@
|
|||||||
|
* text=auto eol=lf
|
||||||
61
.gitignore
vendored
Normal file
61
.gitignore
vendored
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
# OS / Editor
|
||||||
|
.DS_Store
|
||||||
|
.vscode/
|
||||||
|
|
||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*.pyo
|
||||||
|
*.pyd
|
||||||
|
*.so
|
||||||
|
*.pkl
|
||||||
|
*.pickle
|
||||||
|
.pytest_cache/
|
||||||
|
.mypy_cache/
|
||||||
|
.coverage
|
||||||
|
coverage.xml
|
||||||
|
|
||||||
|
# Environments
|
||||||
|
.env
|
||||||
|
.env.*
|
||||||
|
.venv/
|
||||||
|
venv/
|
||||||
|
|
||||||
|
# Project outputs (large or generated)
|
||||||
|
data/
|
||||||
|
!data/.gitkeep
|
||||||
|
models/
|
||||||
|
!models/.gitkeep
|
||||||
|
checkpoints/
|
||||||
|
runs/
|
||||||
|
|
||||||
|
# Sessions / secrets / sqlite
|
||||||
|
*.session
|
||||||
|
*.sqlite*
|
||||||
|
*.db
|
||||||
|
*.log
|
||||||
|
|
||||||
|
# Notebooks
|
||||||
|
.ipynb_checkpoints/
|
||||||
|
|
||||||
|
# Caches and locks
|
||||||
|
.cache/
|
||||||
|
*.lock# Python
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
*.pyo
|
||||||
|
*.pyd
|
||||||
|
.venv/
|
||||||
|
venv/
|
||||||
|
.env
|
||||||
|
*.env
|
||||||
|
|
||||||
|
# Telethon session files
|
||||||
|
*.session
|
||||||
|
*.session-journal
|
||||||
|
|
||||||
|
# Jupyter
|
||||||
|
.ipynb_checkpoints/
|
||||||
|
|
||||||
|
# macOS
|
||||||
|
.DS_Store
|
||||||
205
README.md
Normal file
205
README.md
Normal file
@@ -0,0 +1,205 @@
|
|||||||
|
# Telegram analytics toolkit
|
||||||
|
|
||||||
|
Scrape public Telegram channel posts, fetch replies and forwards, and generate rich analytics reports with tagging, sentiment, matchday overlays, and plots. Use VADER, a local transformers model, or a local GPT (Ollama) backend for sentiment.
|
||||||
|
|
||||||
|
Highlights:
|
||||||
|
- Fast replies scraping with concurrency, resume/append, and rate-limit visibility
|
||||||
|
- Forwards scanning with chunked, concurrent search
|
||||||
|
- Analyzer: tagging from YAML keywords; sentiment via VADER, transformers, or local GPT; emoji-aware modes; combined posts+replies metrics; and matchday cross-analysis
|
||||||
|
- Plots: daily activity with in-plot match labels, daily volume vs sentiment (new), heatmaps, and per-tag (team) sentiment shares
|
||||||
|
- Local learning: fine-tune and evaluate a transformers classifier and use it in analysis
|
||||||
|
|
||||||
|
Full command reference is in `docs/COMMANDS.md`.
|
||||||
|
|
||||||
|
## Quick start
|
||||||
|
|
||||||
|
1) Configure secrets in `.env` (script will prompt if absent):
|
||||||
|
```
|
||||||
|
TELEGRAM_API_ID=123456
|
||||||
|
TELEGRAM_API_HASH=your_api_hash
|
||||||
|
# Optional
|
||||||
|
TELEGRAM_SESSION_NAME=telegram
|
||||||
|
TELEGRAM_2FA_PASSWORD=your_2fa_password
|
||||||
|
FOOTBALL_DATA_API_TOKEN=your_token
|
||||||
|
```
|
||||||
|
|
||||||
|
2) Run any command via the wrapper (creates venv and installs deps automatically):
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Fetch messages to CSV
|
||||||
|
./run_scraper.sh scrape -c https://t.me/Premier_League_Update -o data/premier_league_update.csv --start-date 2025-08-15 --end-date 2025-10-15
|
||||||
|
|
||||||
|
# Fetch replies fast
|
||||||
|
./run_scraper.sh replies -c https://t.me/Premier_League_Update --from-csv data/premier_league_update.csv -o data/premier_league_replies.csv --min-replies 1 --concurrency 15 --resume --append
|
||||||
|
|
||||||
|
# Analyze with tags, fixtures, emoji handling and plots
|
||||||
|
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv --fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv --tags-config config/tags.yaml --write-augmented-csv --write-combined-csv --emoji-mode keep --emoji-boost --save-plots
|
||||||
|
```
|
||||||
|
|
||||||
|
3) Use transformers sentiment instead of VADER:
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Off-the-shelf fine-tuned sentiment head
|
||||||
|
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model distilbert-base-uncased-finetuned-sst-2-english \
|
||||||
|
--export-transformers-details \
|
||||||
|
--write-augmented-csv --write-combined-csv --save-plots
|
||||||
|
```
|
||||||
|
|
||||||
|
4) Use a local GPT backend (Ollama) for sentiment (JSON labels+confidence mapped to a compound score):
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Ensure Ollama is running locally and the model is available (e.g., llama3)
|
||||||
|
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
|
||||||
|
--sentiment-backend gpt \
|
||||||
|
--gpt-model llama3 \
|
||||||
|
--gpt-base-url http://localhost:11434 \
|
||||||
|
--write-augmented-csv --write-combined-csv --save-plots
|
||||||
|
```
|
||||||
|
|
||||||
|
## Aliases
|
||||||
|
|
||||||
|
Convenient zsh functions live in `scripts/aliases.zsh`:
|
||||||
|
|
||||||
|
- `fast_replies` — resume+append replies with concurrency
|
||||||
|
- `chunked_forwards` — concurrent forwards scan
|
||||||
|
- `analyze_combined` — posts+replies+fixtures with tags
|
||||||
|
- `analyze_emoji` — emoji-aware analyze with boost
|
||||||
|
- `analyze_transformers` — analyze with transformers and export details
|
||||||
|
- `apply_labels_and_analyze` — merge a labeled CSV into posts/replies and run analyzer (reuses sentiment_label)
|
||||||
|
- `plot_labeled` — QA plots from a labeled CSV (class distribution, confidence, lengths)
|
||||||
|
- `train_transformers` — fine-tune a model on a labeled CSV
|
||||||
|
- `eval_transformers` — evaluate a fine-tuned model
|
||||||
|
|
||||||
|
Source them:
|
||||||
|
```zsh
|
||||||
|
source scripts/aliases.zsh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Local transformers (optional)
|
||||||
|
|
||||||
|
Train a classifier:
|
||||||
|
```zsh
|
||||||
|
./.venv/bin/python -m src.train_sentiment \
|
||||||
|
--train-csv data/labeled_sentiment.csv \
|
||||||
|
--text-col message \
|
||||||
|
--label-col label \
|
||||||
|
--model-name distilbert-base-uncased \
|
||||||
|
--output-dir models/sentiment-distilbert \
|
||||||
|
--epochs 3 --batch-size 16
|
||||||
|
```
|
||||||
|
|
||||||
|
Evaluate it:
|
||||||
|
```zsh
|
||||||
|
./.venv/bin/python -m src.eval_sentiment \
|
||||||
|
--csv data/labeled_holdout.csv \
|
||||||
|
--text-col message \
|
||||||
|
--label-col label \
|
||||||
|
--model models/sentiment-distilbert
|
||||||
|
```
|
||||||
|
|
||||||
|
Use it in analyze:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze -i data/premier_league_update.csv --replies-csv data/premier_league_replies.csv \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model models/sentiment-distilbert \
|
||||||
|
--export-transformers-details \
|
||||||
|
--write-augmented-csv --write-combined-csv --save-plots
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- GPU/Apple Silicon (MPS) is auto-detected; CPU is the fallback.
|
||||||
|
- Torch pinning in `requirements.txt` uses conditional versions for smooth installs across Python versions.
|
||||||
|
|
||||||
|
## Plots produced (when --save-plots is used)
|
||||||
|
|
||||||
|
- `daily_activity_stacked.png` — stacked bar chart of posts vs replies per day.
|
||||||
|
- Dynamic sizing: `--plot-width-scale`, `--plot-max-width`, `--plot-height`
|
||||||
|
- Top-N highlights: `--activity-top-n` (labels show total and posts+replies breakdown)
|
||||||
|
- Match labels inside the plot using team abbreviations; control density with:
|
||||||
|
- `--labels-max-per-day`, `--labels-per-line`, `--labels-stagger-rows`, `--labels-band-y`, `--labels-annotate-mode`
|
||||||
|
- `daily_volume_and_sentiment.png` — total volume (posts+replies) per day as bars (left Y) and positive%/negative% as lines (right Y). Uses `sentiment_label` when present, otherwise `sentiment_compound` thresholds.
|
||||||
|
- `posts_heatmap_hour_dow.png` — heatmap of posts activity by hour and day-of-week.
|
||||||
|
- `sentiment_by_tag_posts.png` — stacked shares of pos/neu/neg by team tag (tags starting with `club_`), with dynamic width.
|
||||||
|
- Matchday rollups (when fixtures are provided):
|
||||||
|
- `matchday_sentiment_overall.csv` — per-fixture-day aggregates for posts (and replies when provided)
|
||||||
|
- `matchday_sentiment_overall.png` — mean sentiment time series on matchdays (posts, replies)
|
||||||
|
- `matchday_posts_volume_vs_sentiment.png` — scatter of posts volume vs mean sentiment on matchdays
|
||||||
|
- Diagnostics:
|
||||||
|
- `match_labels_debug.csv` — per-day list of rendered match labels (helps tune label density)
|
||||||
|
|
||||||
|
Tip: The analyzer adapts plot width to the number of days; for very long ranges, raise `--plot-max-width`.
|
||||||
|
|
||||||
|
## Plot sizing and label flags (analyze)
|
||||||
|
|
||||||
|
- `--plot-width-scale` (default 0.8): inches per day for the daily charts width.
|
||||||
|
- `--plot-max-width` (default 104): cap on width in inches.
|
||||||
|
- `--plot-height` (default 6.5): figure height in inches.
|
||||||
|
- `--activity-top-n` (default 5): highlight top-N activity days; 0 disables.
|
||||||
|
- Match label controls:
|
||||||
|
- `--labels-max-per-day` (default 3): cap labels per day (+N more).
|
||||||
|
- `--labels-per-line` (default 2): labels per line in the band.
|
||||||
|
- `--labels-band-y` (default 0.96): vertical position of the band (axes coords).
|
||||||
|
- `--labels-stagger-rows` (default 2): stagger rows to reduce collisions.
|
||||||
|
- `--labels-annotate-mode` (ticks|all|ticks+top): which x positions get labels.
|
||||||
|
|
||||||
|
## Automatic labeling (no manual annotation)
|
||||||
|
|
||||||
|
If you don't want to label data by hand, generate a labeled training set automatically and train a local model.
|
||||||
|
|
||||||
|
Label with VADER (fast) or a pretrained transformers model (higher quality):
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Load aliases
|
||||||
|
source scripts/aliases.zsh
|
||||||
|
|
||||||
|
# VADER: keeps only confident predictions by default
|
||||||
|
auto_label_vader
|
||||||
|
|
||||||
|
# Or Transformers: CardiffNLP 3-class sentiment (keeps confident only)
|
||||||
|
auto_label_transformers
|
||||||
|
|
||||||
|
# Output: data/labeled_sentiment.csv (message, label, confidence, ...)
|
||||||
|
```
|
||||||
|
|
||||||
|
Then fine-tune a classifier on the generated labels and use it in analysis:
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Train on the auto-labeled CSV
|
||||||
|
train_transformers
|
||||||
|
|
||||||
|
# Analyze using your fine-tuned model
|
||||||
|
./run_scraper.sh analyze -i data/premier_league_update.csv \
|
||||||
|
--replies-csv data/premier_league_replies.csv \
|
||||||
|
--fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \
|
||||||
|
--tags-config config/tags.yaml \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model models/sentiment-distilbert \
|
||||||
|
--export-transformers-details \
|
||||||
|
--write-augmented-csv --write-combined-csv --save-plots
|
||||||
|
```
|
||||||
|
|
||||||
|
Advanced knobs (optional):
|
||||||
|
- VADER thresholds: `--vader-pos 0.05 --vader-neg -0.05 --vader-margin 0.2`
|
||||||
|
- Transformers acceptance: `--min-prob 0.6 --min-margin 0.2`
|
||||||
|
- Keep all predictions (not just confident): remove `--only-confident`
|
||||||
|
|
||||||
|
## Local GPT backend (Ollama)
|
||||||
|
|
||||||
|
You can use a local GPT model for sentiment. The analyzer requests strict JSON `{label, confidence}` and maps it to a compound score. If the GPT call fails for any rows, it gracefully falls back to VADER for those rows.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze -i data/premier_league_update.csv \
|
||||||
|
--replies-csv data/premier_league_replies.csv \
|
||||||
|
--fixtures-csv data/premier_league_schedule_2025-08-15_to_2025-10-15.csv \
|
||||||
|
--tags-config config/tags.yaml \
|
||||||
|
--sentiment-backend gpt \
|
||||||
|
--gpt-model llama3 \
|
||||||
|
--gpt-base-url http://localhost:11434 \
|
||||||
|
--write-augmented-csv --write-combined-csv --save-plots
|
||||||
|
```
|
||||||
|
|
||||||
|
## License
|
||||||
|
MIT (adjust as needed)
|
||||||
103
config/tags.yaml
Normal file
103
config/tags.yaml
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
# Keyword tag configuration
|
||||||
|
# Each tag has a list of case-insensitive substrings or regex patterns (prefix with re:)
|
||||||
|
# Messages matching ANY pattern for a tag are labeled with that tag.
|
||||||
|
|
||||||
|
score_update:
|
||||||
|
- "FT"
|
||||||
|
- "full time"
|
||||||
|
- "final score"
|
||||||
|
- "HT"
|
||||||
|
- "half time"
|
||||||
|
- "kick-off"
|
||||||
|
- "kick off"
|
||||||
|
|
||||||
|
transfer:
|
||||||
|
- "transfer"
|
||||||
|
- "signs"
|
||||||
|
- "signed"
|
||||||
|
- "loan"
|
||||||
|
- "contract"
|
||||||
|
- "deal"
|
||||||
|
|
||||||
|
injury:
|
||||||
|
- "injury"
|
||||||
|
- "injured"
|
||||||
|
- "out for"
|
||||||
|
- "ruled out"
|
||||||
|
|
||||||
|
match_highlight:
|
||||||
|
- "goal"
|
||||||
|
- "scores"
|
||||||
|
- "assist"
|
||||||
|
- "penalty"
|
||||||
|
- "VAR"
|
||||||
|
- "red card"
|
||||||
|
- "yellow card"
|
||||||
|
|
||||||
|
club_arsenal:
|
||||||
|
- "Arsenal"
|
||||||
|
club_manchester_city:
|
||||||
|
- "Manchester City"
|
||||||
|
club_manchester_united:
|
||||||
|
- "Manchester United"
|
||||||
|
club_chelsea:
|
||||||
|
- "Chelsea"
|
||||||
|
club_liverpool:
|
||||||
|
- "Liverpool"
|
||||||
|
club_tottenham:
|
||||||
|
- "Tottenham"
|
||||||
|
club_newcastle:
|
||||||
|
- "Newcastle"
|
||||||
|
club_west_ham:
|
||||||
|
- "West Ham"
|
||||||
|
club_brighton:
|
||||||
|
- "Brighton"
|
||||||
|
club_aston_villa:
|
||||||
|
- "Aston Villa"
|
||||||
|
club_everton:
|
||||||
|
- "Everton"
|
||||||
|
club_crystal_palace:
|
||||||
|
- "Crystal Palace"
|
||||||
|
- "Palace"
|
||||||
|
club_bournemouth:
|
||||||
|
- "Bournemouth"
|
||||||
|
- "AFC Bournemouth"
|
||||||
|
club_brentford:
|
||||||
|
- "Brentford"
|
||||||
|
club_fulham:
|
||||||
|
- "Fulham"
|
||||||
|
club_nottingham_forest:
|
||||||
|
- "Nottingham Forest"
|
||||||
|
- "Forest"
|
||||||
|
club_wolves:
|
||||||
|
- "Wolves"
|
||||||
|
- "Wolverhampton"
|
||||||
|
club_burnley:
|
||||||
|
- "Burnley"
|
||||||
|
club_southampton:
|
||||||
|
- "Southampton"
|
||||||
|
- "Saints"
|
||||||
|
club_leicester_city:
|
||||||
|
- "Leicester"
|
||||||
|
- "Leicester City"
|
||||||
|
club_leeds_united:
|
||||||
|
- "Leeds"
|
||||||
|
- "Leeds United"
|
||||||
|
club_sheffield_united:
|
||||||
|
- "Sheffield United"
|
||||||
|
- "Sheff Utd"
|
||||||
|
club_west_bromwich_albion:
|
||||||
|
- "West Brom"
|
||||||
|
- "West Bromwich"
|
||||||
|
club_ipswich_town:
|
||||||
|
- "Ipswich"
|
||||||
|
- "Ipswich Town"
|
||||||
|
club_portsmouth:
|
||||||
|
- "Portsmouth"
|
||||||
|
- "Pompey"
|
||||||
|
club_hull_city:
|
||||||
|
- "Hull"
|
||||||
|
- "Hull City"
|
||||||
|
club_middlesbrough:
|
||||||
|
- "Middlesbrough"
|
||||||
|
- "Boro"
|
||||||
743
docs/COMMANDS.md
Normal file
743
docs/COMMANDS.md
Normal file
@@ -0,0 +1,743 @@
|
|||||||
|
# Project command reference
|
||||||
|
|
||||||
|
This file lists all supported commands and practical permutations for `./run_scraper.sh`, with short comments and tips. It mirrors the actual CLI flags in the code.
|
||||||
|
|
||||||
|
- Shell: zsh (macOS) — commands below are ready to paste.
|
||||||
|
- Env: A `.venv` is created automatically; dependencies installed from `requirements.txt`.
|
||||||
|
- Secrets: Create `.env` with TELEGRAM_API_ID and TELEGRAM_API_HASH; for fixtures also set FOOTBALL_DATA_API_TOKEN.
|
||||||
|
- 2FA: If you use Telegram two-step verification, set TELEGRAM_2FA_PASSWORD in `.env` (the shell wrapper doesn’t accept a flag for this).
|
||||||
|
- Sessions: Telethon uses a SQLite session file (default `telegram.session`). When running multiple tools in parallel, use distinct `--session-name` values.
|
||||||
|
|
||||||
|
## Common conventions
|
||||||
|
|
||||||
|
- Channels
|
||||||
|
- Use either handle or URL: `-c @name` or `-c https://t.me/name`.
|
||||||
|
- For replies, the channel must match the posts’ source in your CSV `url` column.
|
||||||
|
- Output behavior
|
||||||
|
- scrape/replies/forwards overwrite unless you pass `--append`.
|
||||||
|
- analyze always overwrites its outputs.
|
||||||
|
- Rate-limits
|
||||||
|
- Replies/forwards log `[rate-limit]` if Telegram asks you to wait. Reduce `--concurrency` if frequent.
|
||||||
|
- Parallel runs
|
||||||
|
- Add `--session-name <unique>` per process to avoid “database is locked”. Prefer sessions outside iCloud Drive.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scrape (posts/messages)
|
||||||
|
|
||||||
|
Minimal (overwrite output):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
With date range and limit:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh scrape \
|
||||||
|
-c https://t.me/SomeChannel \
|
||||||
|
-o data/messages.jsonl \
|
||||||
|
--start-date 2025-01-01 \
|
||||||
|
--end-date 2025-03-31 \
|
||||||
|
--limit 500
|
||||||
|
```
|
||||||
|
|
||||||
|
Legacy offset date (deprecated; prefer --start-date):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --offset-date 2025-01-01
|
||||||
|
```
|
||||||
|
|
||||||
|
Append to existing file and pass phone on first login:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh scrape \
|
||||||
|
-c @SomeChannel \
|
||||||
|
-o data/messages.csv \
|
||||||
|
--append \
|
||||||
|
--phone +15551234567
|
||||||
|
```
|
||||||
|
|
||||||
|
Use a custom session (useful in parallel):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --session-name telegram_scrape
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- Output format inferred by extension: `.csv` or `.jsonl`/`.ndjson`.
|
||||||
|
- Two-step verification: set TELEGRAM_2FA_PASSWORD in `.env` (no CLI flag in the shell wrapper).
|
||||||
|
|
||||||
|
### All valid forms (scrape)
|
||||||
|
|
||||||
|
Use one of the following combinations. Replace placeholders with your values.
|
||||||
|
|
||||||
|
- Base variables:
|
||||||
|
- CH = @handle or https://t.me/handle
|
||||||
|
- OUT = path to .csv or .jsonl
|
||||||
|
- Optional value flags: [--limit N] [--session-name NAME] [--phone NUMBER]
|
||||||
|
|
||||||
|
- Date filter permutations (4) × Append flag (2) × Limit presence (2) = 16 forms
|
||||||
|
|
||||||
|
1) No dates, no append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT
|
||||||
|
2) No dates, no append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --limit N
|
||||||
|
3) No dates, with append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --append
|
||||||
|
4) No dates, with append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --append --limit N
|
||||||
|
5) Start only, no append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD
|
||||||
|
6) Start only, no append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --limit N
|
||||||
|
7) Start only, with append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append
|
||||||
|
8) Start only, with append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append --limit N
|
||||||
|
9) End only, no append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD
|
||||||
|
10) End only, no append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --limit N
|
||||||
|
11) End only, with append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append
|
||||||
|
12) End only, with append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append --limit N
|
||||||
|
13) Start and end, no append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD
|
||||||
|
14) Start and end, no append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --limit N
|
||||||
|
15) Start and end, with append, no limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append
|
||||||
|
16) Start and end, with append, with limit
|
||||||
|
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append --limit N
|
||||||
|
|
||||||
|
Optional add-ons valid for any form above:
|
||||||
|
- Append [--session-name NAME] and/or [--phone NUMBER]
|
||||||
|
- Deprecated alternative to start-date: add [--offset-date YYYY-MM-DD]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Replies (fetch replies to posts)
|
||||||
|
|
||||||
|
From a posts CSV (fast path; skip posts with 0 replies in CSV):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh replies \
|
||||||
|
-c https://t.me/SourceChannel \
|
||||||
|
--from-csv data/messages.csv \
|
||||||
|
-o data/replies.csv \
|
||||||
|
--min-replies 1 \
|
||||||
|
--concurrency 15 \
|
||||||
|
--resume \
|
||||||
|
--append
|
||||||
|
```
|
||||||
|
|
||||||
|
Using explicit message IDs:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh replies \
|
||||||
|
-c @SourceChannel \
|
||||||
|
--ids "123,456,789" \
|
||||||
|
-o data/replies.csv \
|
||||||
|
--concurrency 5 \
|
||||||
|
--append
|
||||||
|
```
|
||||||
|
|
||||||
|
IDs from a file (one per line) using zsh substitution:
|
||||||
|
```zsh
|
||||||
|
IDS=$(tr '\n' ',' < parent_ids.txt | sed 's/,$//')
|
||||||
|
./run_scraper.sh replies -c @SourceChannel --ids "$IDS" -o data/replies.csv --concurrency 8 --append
|
||||||
|
```
|
||||||
|
|
||||||
|
Parallel-safe session name:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh replies -c @SourceChannel --from-csv data/messages.csv -o data/replies.csv --concurrency 12 --resume --append --session-name telegram_replies
|
||||||
|
```
|
||||||
|
|
||||||
|
What the flags do:
|
||||||
|
- `--from-csv PATH` reads parent IDs from a CSV with an `id` column (optionally filtered by `--min-replies`).
|
||||||
|
- `--ids` provides a comma-separated list of parent IDs.
|
||||||
|
- `--concurrency K` processes K parent IDs in parallel (default 5).
|
||||||
|
- `--resume` dedupes by `(parent_id,id)` pairs already present in the output.
|
||||||
|
- `--append` appends to output instead of overwriting.
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- The channel (`-c`) must match the posts’ source in your CSV URLs (the tool warns on mismatch).
|
||||||
|
- First login may require `--phone` (interactive prompt). For 2FA, set TELEGRAM_2FA_PASSWORD in `.env`.
|
||||||
|
|
||||||
|
### All valid forms (replies)
|
||||||
|
|
||||||
|
- Base variables:
|
||||||
|
- CH = @handle or https://t.me/handle
|
||||||
|
- OUT = path to .csv
|
||||||
|
- Source: exactly one of S1 or S2
|
||||||
|
- S1: --ids "id1,id2,..."
|
||||||
|
- S2: --from-csv PATH [--min-replies N]
|
||||||
|
- Optional: [--concurrency K] [--session-name NAME] [--phone NUMBER]
|
||||||
|
- Binary: [--append], [--resume]
|
||||||
|
|
||||||
|
- Enumerated binary permutations for each source (4 per source = 8 total):
|
||||||
|
|
||||||
|
S1 + no append + no resume
|
||||||
|
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT
|
||||||
|
S1 + no append + resume
|
||||||
|
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --resume
|
||||||
|
S1 + append + no resume
|
||||||
|
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append
|
||||||
|
S1 + append + resume
|
||||||
|
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append --resume
|
||||||
|
|
||||||
|
S2 + no append + no resume
|
||||||
|
./run_scraper.sh replies -c CH --from-csv PATH -o OUT
|
||||||
|
S2 + no append + resume
|
||||||
|
./run_scraper.sh replies -c CH --from-csv PATH -o OUT --resume
|
||||||
|
S2 + append + no resume
|
||||||
|
./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append
|
||||||
|
S2 + append + resume
|
||||||
|
./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append --resume
|
||||||
|
|
||||||
|
Optional add-ons valid for any form above:
|
||||||
|
- Add [--concurrency K] to tune speed; recommended 8–20
|
||||||
|
- With S2 you may add [--min-replies N] to prioritize parents with replies
|
||||||
|
- Add [--session-name NAME] and/or [--phone NUMBER]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Forwards (same-channel forwards referencing posts)
|
||||||
|
|
||||||
|
Typical concurrent scan (best-effort; often zero results):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh forwards \
|
||||||
|
-c https://t.me/SourceChannel \
|
||||||
|
--from-csv data/messages.csv \
|
||||||
|
-o data/forwards.csv \
|
||||||
|
--scan-limit 20000 \
|
||||||
|
--concurrency 10 \
|
||||||
|
--chunk-size 1500
|
||||||
|
```
|
||||||
|
|
||||||
|
With date filters (applied to scanned messages):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh forwards \
|
||||||
|
-c @SourceChannel \
|
||||||
|
--from-csv data/messages.csv \
|
||||||
|
-o data/forwards.csv \
|
||||||
|
--start-date 2025-01-01 \
|
||||||
|
--end-date 2025-03-31 \
|
||||||
|
--scan-limit 10000 \
|
||||||
|
--concurrency 8 \
|
||||||
|
--chunk-size 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
Using explicit message IDs:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh forwards -c @SourceChannel --ids "100,200,300" -o data/forwards.csv --scan-limit 8000 --concurrency 6 --chunk-size 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
Sequential mode (no chunking) by omitting --scan-limit:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh forwards -c @SourceChannel --from-csv data/messages.csv -o data/forwards.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
What the flags do:
|
||||||
|
- `--scan-limit N`: enables chunked, concurrent scanning of ~N recent message IDs.
|
||||||
|
- `--concurrency K`: number of id-chunks to scan in parallel (requires `--scan-limit`).
|
||||||
|
- `--chunk-size M`: approx. IDs per chunk (trade-off between balance/overhead). Start with 1000–2000.
|
||||||
|
- `--append`: append instead of overwrite.
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- This only finds forwards within the same channel that reference your parent IDs (self-forwards). Many channels will yield zero.
|
||||||
|
- Global cross-channel forward discovery is not supported here (can be added as a separate mode).
|
||||||
|
- Without `--scan-limit`, the tool scans sequentially from newest backwards and logs progress every ~1000 messages.
|
||||||
|
|
||||||
|
### All valid forms (forwards)
|
||||||
|
|
||||||
|
- Base variables:
|
||||||
|
- CH = @handle or https://t.me/handle
|
||||||
|
- OUT = path to .csv
|
||||||
|
- Source: exactly one of S1 or S2
|
||||||
|
- S1: --ids "id1,id2,..."
|
||||||
|
- S2: --from-csv PATH
|
||||||
|
- Modes:
|
||||||
|
- M1: Sequential scan (omit --scan-limit)
|
||||||
|
- M2: Chunked concurrent scan (requires --scan-limit N; accepts --concurrency K and --chunk-size M)
|
||||||
|
- Optional date filters for both modes: [--start-date D] [--end-date D]
|
||||||
|
- Binary: [--append]
|
||||||
|
- Optional: [--session-name NAME] [--phone NUMBER]
|
||||||
|
|
||||||
|
- Enumerated permutations by mode, source, and append (2 modes × 2 sources × 2 append = 8 forms):
|
||||||
|
|
||||||
|
M1 + S1 + no append
|
||||||
|
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT [--start-date D] [--end-date D]
|
||||||
|
M1 + S1 + append
|
||||||
|
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --append [--start-date D] [--end-date D]
|
||||||
|
M1 + S2 + no append
|
||||||
|
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT [--start-date D] [--end-date D]
|
||||||
|
M1 + S2 + append
|
||||||
|
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --append [--start-date D] [--end-date D]
|
||||||
|
|
||||||
|
M2 + S1 + no append
|
||||||
|
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
|
||||||
|
M2 + S1 + append
|
||||||
|
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
|
||||||
|
M2 + S2 + no append
|
||||||
|
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
|
||||||
|
M2 + S2 + append
|
||||||
|
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
|
||||||
|
|
||||||
|
Optional add-ons valid for any form above:
|
||||||
|
- Add [--session-name NAME] and/or [--phone NUMBER]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Analyze (reports and tagging)
|
||||||
|
|
||||||
|
Posts-only report + tagged CSV:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i data/messages.csv \
|
||||||
|
--channel @SourceChannel \
|
||||||
|
--tags-config config/tags.yaml \
|
||||||
|
--fixtures-csv data/fixtures.csv \
|
||||||
|
--write-augmented-csv
|
||||||
|
```
|
||||||
|
Outputs:
|
||||||
|
- `data/messages_report.md`
|
||||||
|
- `data/messages_tagged.csv`
|
||||||
|
|
||||||
|
Replies-only report + tagged CSV:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i data/replies.csv \
|
||||||
|
--channel "Replies - @SourceChannel" \
|
||||||
|
--tags-config config/tags.yaml \
|
||||||
|
--write-augmented-csv
|
||||||
|
```
|
||||||
|
Outputs:
|
||||||
|
- `data/replies_report.md`
|
||||||
|
- `data/replies_tagged.csv`
|
||||||
|
|
||||||
|
Combined (posts report augmented with replies):
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i data/messages.csv \
|
||||||
|
--channel @SourceChannel \
|
||||||
|
--tags-config config/tags.yaml \
|
||||||
|
--replies-csv data/replies.csv \
|
||||||
|
--fixtures-csv data/fixtures.csv \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv \
|
||||||
|
--emoji-mode keep \
|
||||||
|
--emoji-boost \
|
||||||
|
--save-plots
|
||||||
|
```
|
||||||
|
Adds to posts dataset:
|
||||||
|
- `sentiment_compound` for posts (VADER)
|
||||||
|
- `replies_sentiment_mean` (avg reply sentiment per post)
|
||||||
|
- `replies_count_scraped` and `replies_top_tags` (rollup from replies)
|
||||||
|
|
||||||
|
Report sections include:
|
||||||
|
- Summary, top posts by views/forwards/replies
|
||||||
|
- Temporal distributions
|
||||||
|
- Per-tag engagement
|
||||||
|
- Per-tag sentiment (posts)
|
||||||
|
- Replies per-tag summary
|
||||||
|
- Per-tag sentiment (replies)
|
||||||
|
- Combined sentiment (posts + replies)
|
||||||
|
- Matchday cross-analysis (when `--fixtures-csv` is provided):
|
||||||
|
- Posts: on vs off matchdays (counts and sentiment shares)
|
||||||
|
- Posts engagement vs matchday (replies per post: total, mean, median, share of posts with replies)
|
||||||
|
- Replies: on vs off matchdays (counts and sentiment shares)
|
||||||
|
- Replies by parent matchday and by reply date are both shown; parent-based classification is recommended for engagement.
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- Analyze overwrites outputs; use `-o` to customize report filename if needed.
|
||||||
|
- Emoji handling: add `--emoji-mode keep|demojize|strip` (default keep). Optionally `--emoji-boost` to gently tilt scores when clearly positive/negative emojis are present.
|
||||||
|
- Add `--write-combined-csv` to emit a unified CSV of posts+replies with a `content_type` column.
|
||||||
|
|
||||||
|
### All valid forms (analyze)
|
||||||
|
|
||||||
|
- Base variables:
|
||||||
|
- IN = input CSV (posts or replies)
|
||||||
|
- Optional outputs/labels: [-o REPORT.md] [--channel @handle]
|
||||||
|
- Optional configs/data: [--tags-config config/tags.yaml] [--replies-csv REPLIES.csv] [--fixtures-csv FIXTURES.csv]
|
||||||
|
- Binary: [--write-augmented-csv]
|
||||||
|
|
||||||
|
- Core permutations across replies-csv, fixtures-csv, write-augmented-csv (2×2×2 = 8 forms):
|
||||||
|
|
||||||
|
1) No replies, no fixtures, no aug
|
||||||
|
./run_scraper.sh analyze -i IN
|
||||||
|
2) No replies, no fixtures, with aug
|
||||||
|
./run_scraper.sh analyze -i IN --write-augmented-csv
|
||||||
|
3) No replies, with fixtures, no aug
|
||||||
|
./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv
|
||||||
|
4) No replies, with fixtures, with aug
|
||||||
|
./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv --write-augmented-csv
|
||||||
|
5) With replies, no fixtures, no aug
|
||||||
|
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv
|
||||||
|
6) With replies, no fixtures, with aug
|
||||||
|
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --write-augmented-csv
|
||||||
|
7) With replies, with fixtures, no aug
|
||||||
|
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv
|
||||||
|
8) With replies, with fixtures, with aug
|
||||||
|
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv --write-augmented-csv
|
||||||
|
|
||||||
|
Optional add-ons valid for any form above:
|
||||||
|
- Append [-o REPORT.md] to control output filename
|
||||||
|
- Append [--channel @handle] for title
|
||||||
|
- Append [--tags-config config/tags.yaml] to enable tagging and per-tag summaries
|
||||||
|
- Append [--emoji-mode keep|demojize|strip] and optionally [--emoji-boost]
|
||||||
|
- Append [--write-combined-csv] to produce a merged posts+replies CSV
|
||||||
|
- Append [--save-plots] to emit plots to the data folder
|
||||||
|
- Append [--sentiment-backend transformers] and [--transformers-model <name-or-path>] to use a local HF model instead of VADER
|
||||||
|
- Append [--export-transformers-details] to include `sentiment_label` and `sentiment_probs` in augmented/combined CSVs
|
||||||
|
- Append [--sentiment-backend gpt] and optionally [--gpt-model MODEL] [--gpt-base-url URL] [--gpt-batch-size K] to use a local GPT (Ollama) backend
|
||||||
|
- Plot sizing and label controls (daily charts):
|
||||||
|
- [--plot-width-scale FLOAT] [--plot-max-width INCHES] [--plot-height INCHES]
|
||||||
|
- [--activity-top-n N]
|
||||||
|
- [--labels-max-per-day N] [--labels-per-line N] [--labels-band-y FLOAT] [--labels-stagger-rows N] [--labels-annotate-mode ticks|all|ticks+top]
|
||||||
|
|
||||||
|
When fixtures are provided (`--fixtures-csv`):
|
||||||
|
- The report adds a "## Matchday cross-analysis" section with on vs off matchday tables.
|
||||||
|
- Plots include:
|
||||||
|
- daily_activity_stacked.png with match labels inside the chart
|
||||||
|
- daily_volume_and_sentiment.png (bars: volume; lines: pos%/neg%)
|
||||||
|
- matchday_sentiment_overall.png (time series on fixture days)
|
||||||
|
- matchday_posts_volume_vs_sentiment.png (scatter)
|
||||||
|
- The combined CSV (with `--write-combined-csv`) includes `is_matchday` and, for replies, `parent_is_matchday` when available.
|
||||||
|
- Replies are classified two ways: by reply date (`is_matchday` on the reply row) and by their parent post (`parent_is_matchday`). The latter better reflects matchday-driven engagement.
|
||||||
|
|
||||||
|
Emoji and plots examples:
|
||||||
|
```zsh
|
||||||
|
# Keep emojis (default) and boost for strong positive/negative emojis
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --emoji-mode keep --emoji-boost --save-plots
|
||||||
|
|
||||||
|
# Demojize to :smiling_face: tokens (helps some tokenizers), with boost
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --emoji-mode demojize --emoji-boost
|
||||||
|
|
||||||
|
# Strip emojis entirely (if they add noise)
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --emoji-mode strip --save-plots
|
||||||
|
|
||||||
|
# Use a transformers model for sentiment (will auto-download on first use unless a local path is provided).
|
||||||
|
# Tip: for an off-the-shelf sentiment head, try a fine-tuned model like SST-2:
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model distilbert-base-uncased-finetuned-sst-2-english
|
||||||
|
|
||||||
|
## Local GPT backend (Ollama)
|
||||||
|
|
||||||
|
Use a local GPT model that returns JSON {label, confidence} per message; the analyzer maps this to a compound score and falls back to VADER on errors.
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
|
||||||
|
--sentiment-backend gpt \
|
||||||
|
--gpt-model llama3 \
|
||||||
|
--gpt-base-url http://localhost:11434 \
|
||||||
|
--write-augmented-csv --write-combined-csv --save-plots
|
||||||
|
```
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Train a local transformers sentiment model
|
||||||
|
|
||||||
|
Prepare a labeled CSV with at least two columns: `message` and `label` (e.g., neg/neu/pos or 0/1/2).
|
||||||
|
|
||||||
|
Don’t have one yet? Create a labeling set from your existing posts/replies:
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Generate a CSV to annotate by hand (adds a blank 'label' column)
|
||||||
|
./.venv/bin/python -m src.make_labeling_set \
|
||||||
|
--posts-csv data/premier_league_update.csv \
|
||||||
|
--replies-csv data/premier_league_replies.csv \
|
||||||
|
--sample-size 1000 \
|
||||||
|
-o data/labeled_sentiment.csv
|
||||||
|
|
||||||
|
# Or via alias (after sourcing scripts/aliases.zsh)
|
||||||
|
make_label_set "$POSTS_CSV" "$REPLIES_CSV" data/labeled_sentiment.csv 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
Then fine-tune:
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Ensure the venv exists (run any ./run_scraper.sh command once), then:
|
||||||
|
./.venv/bin/python -m src.train_sentiment \
|
||||||
|
--train-csv data/labeled_sentiment.csv \
|
||||||
|
--text-col message \
|
||||||
|
--label-col label \
|
||||||
|
--model-name distilbert-base-uncased \
|
||||||
|
--output-dir models/sentiment-distilbert \
|
||||||
|
--epochs 3 --batch-size 16
|
||||||
|
```
|
||||||
|
|
||||||
|
Use it in analyze:
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model models/sentiment-distilbert
|
||||||
|
```
|
||||||
|
|
||||||
|
Export details (labels, probabilities) into CSVs:
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model models/sentiment-distilbert \
|
||||||
|
--export-transformers-details \
|
||||||
|
--write-augmented-csv --write-combined-csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- The analyzer maps model class probabilities to a VADER-like compound score in [-1, 1] for compatibility with the rest of the report.
|
||||||
|
- If the model has id2label including 'neg','neu','pos' labels, the mapping is more accurate; otherwise it defaults to pos - neg.
|
||||||
|
- GPU/Apple Silicon (MPS) will be used automatically if available.
|
||||||
|
|
||||||
|
Torch install note (macOS):
|
||||||
|
- `requirements.txt` uses conditional pins: `torch==2.3.1` for Python < 3.13 and `torch>=2.7.1` for Python ≥ 3.13. This keeps installs smooth on macOS. If you hit install issues, let us know.
|
||||||
|
|
||||||
|
## Evaluate a fine-tuned model
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
./.venv/bin/python -m src.eval_sentiment \
|
||||||
|
--csv data/labeled_holdout.csv \
|
||||||
|
--text-col message \
|
||||||
|
--label-col label \
|
||||||
|
--model models/sentiment-distilbert
|
||||||
|
```
|
||||||
|
Prints accuracy, macro-precision/recall/F1, and a classification report.
|
||||||
|
|
||||||
|
## Fixtures (Premier League schedule via football-data.org)
|
||||||
|
|
||||||
|
Fetch fixtures between dates:
|
||||||
|
```zsh
|
||||||
|
./run_scraper.sh fixtures \
|
||||||
|
--start-date 2025-08-15 \
|
||||||
|
--end-date 2025-10-15 \
|
||||||
|
-o data/fixtures.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- Requires `FOOTBALL_DATA_API_TOKEN` in `.env`.
|
||||||
|
- Output may be `.csv` or `.json` (by extension).
|
||||||
|
|
||||||
|
### All valid forms (fixtures)
|
||||||
|
|
||||||
|
- Base variables:
|
||||||
|
- SD = start date YYYY-MM-DD
|
||||||
|
- ED = end date YYYY-MM-DD
|
||||||
|
- OUT = output .csv or .json
|
||||||
|
|
||||||
|
Form:
|
||||||
|
./run_scraper.sh fixtures --start-date SD --end-date ED -o OUT
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advanced recipes
|
||||||
|
|
||||||
|
Parallel replies + forwards with separate sessions:
|
||||||
|
```zsh
|
||||||
|
# Terminal 1 – replies
|
||||||
|
./run_scraper.sh replies \
|
||||||
|
-c https://t.me/SourceChannel \
|
||||||
|
--from-csv data/messages.csv \
|
||||||
|
-o data/replies.csv \
|
||||||
|
--min-replies 1 \
|
||||||
|
--concurrency 15 \
|
||||||
|
--resume \
|
||||||
|
--append \
|
||||||
|
--session-name "$HOME/.local/share/telethon_sessions/telegram_replies"
|
||||||
|
|
||||||
|
# Terminal 2 – forwards
|
||||||
|
./run_scraper.sh forwards \
|
||||||
|
-c https://t.me/SourceChannel \
|
||||||
|
--from-csv data/messages.csv \
|
||||||
|
-o data/forwards.csv \
|
||||||
|
--scan-limit 20000 \
|
||||||
|
--concurrency 10 \
|
||||||
|
--chunk-size 1500 \
|
||||||
|
--session-name "$HOME/.local/share/telethon_sessions/telegram_forwards"
|
||||||
|
```
|
||||||
|
|
||||||
|
Tuning for rate limits:
|
||||||
|
- If `[rate-limit]` logs are frequent, reduce `--concurrency` (start -3 to -5) and keep `--chunk-size` around 1000–2000.
|
||||||
|
- For replies, prioritize with `--min-replies 1` to avoid parents with zero replies.
|
||||||
|
|
||||||
|
Safety:
|
||||||
|
- Use `--append` with replies and `--resume` to avoid truncating and to dedupe.
|
||||||
|
- Forwards and scrape don’t dedupe; prefer writing to a new file or dedupe after.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment setup quick-start
|
||||||
|
|
||||||
|
Create `.env` (script will prompt if missing):
|
||||||
|
```
|
||||||
|
TELEGRAM_API_ID=123456
|
||||||
|
TELEGRAM_API_HASH=your_api_hash
|
||||||
|
# Optional defaults
|
||||||
|
TELEGRAM_SESSION_NAME=telegram
|
||||||
|
TELEGRAM_2FA_PASSWORD=your_2fa_password
|
||||||
|
FOOTBALL_DATA_API_TOKEN=your_token
|
||||||
|
```
|
||||||
|
|
||||||
|
First run will prompt for phone and code (and 2FA if enabled).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
- Empty replies file
|
||||||
|
- Ensure `-c` matches the channel in your posts CSV URLs.
|
||||||
|
- Use `--append` so the file isn’t truncated before writing.
|
||||||
|
- “database is locked”
|
||||||
|
- Use unique `--session-name` per parallel process; store sessions outside iCloud Drive.
|
||||||
|
- Forwards empty
|
||||||
|
- Same-channel forwards are rare. This tool only finds self-forwards (not cross-channel).
|
||||||
|
- Analyze errors
|
||||||
|
- Ensure CSVs have expected columns. Posts: `id,date,message,...`; Replies: `parent_id,id,date,message,...`.
|
||||||
|
- Exit code 1 when starting
|
||||||
|
- Check the last log lines. Common causes: missing TELEGRAM_API_ID/HASH in `.env`, wrong channel handle vs CSV URLs, session file locked by another process (use distinct `--session-name`), or a bad output path.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick aliases for daily runs (zsh) ⚡
|
||||||
|
|
||||||
|
Paste this section into your current shell or your `~/.zshrc` to get convenient Make-like commands.
|
||||||
|
|
||||||
|
### Project defaults (edit as needed)
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Channel and files
|
||||||
|
export CH="https://t.me/Premier_League_Update"
|
||||||
|
export POSTS_CSV="data/premier_league_update.csv"
|
||||||
|
export REPLIES_CSV="data/premier_league_replies.csv"
|
||||||
|
export FORWARDS_CSV="data/premier_league_forwards.csv"
|
||||||
|
export TAGS_CFG="config/tags.yaml"
|
||||||
|
export FIXTURES_CSV="data/premier_league_schedule_2025-08-15_to_2025-10-15.csv"
|
||||||
|
|
||||||
|
# Sessions directory outside iCloud (avoid sqlite locks)
|
||||||
|
export SESSION_DIR="$HOME/.local/share/telethon_sessions"
|
||||||
|
mkdir -p "$SESSION_DIR"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Aliases (zsh functions)
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Fast replies: resume+append, prioritizes parents with replies, tuned concurrency
|
||||||
|
fast_replies() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local out="${3:-$REPLIES_CSV}"
|
||||||
|
local conc="${4:-15}"
|
||||||
|
local sess="${5:-$SESSION_DIR/telegram_replies}"
|
||||||
|
./run_scraper.sh replies \
|
||||||
|
-c "$ch" \
|
||||||
|
--from-csv "$posts" \
|
||||||
|
-o "$out" \
|
||||||
|
--min-replies 1 \
|
||||||
|
--concurrency "$conc" \
|
||||||
|
--resume \
|
||||||
|
--append \
|
||||||
|
--session-name "$sess"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Chunked forwards: concurrent chunk scan with progress logs
|
||||||
|
chunked_forwards() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local out="${3:-$FORWARDS_CSV}"
|
||||||
|
local scan="${4:-20000}"
|
||||||
|
local conc="${5:-10}"
|
||||||
|
local chunk="${6:-1500}"
|
||||||
|
local sess="${7:-$SESSION_DIR/telegram_forwards}"
|
||||||
|
./run_scraper.sh forwards \
|
||||||
|
-c "$ch" \
|
||||||
|
--from-csv "$posts" \
|
||||||
|
-o "$out" \
|
||||||
|
--scan-limit "$scan" \
|
||||||
|
--concurrency "$conc" \
|
||||||
|
--chunk-size "$chunk" \
|
||||||
|
--append \
|
||||||
|
--session-name "$sess"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Combined analyze: posts + replies + fixtures with tags; writes augmented CSVs
|
||||||
|
analyze_combined() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local tags="${3:-$TAGS_CFG}"
|
||||||
|
local fixtures="${4:-$FIXTURES_CSV}"
|
||||||
|
local ch="${5:-$CH}"
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i "$posts" \
|
||||||
|
--channel "$ch" \
|
||||||
|
--tags-config "$tags" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--fixtures-csv "$fixtures" \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv
|
||||||
|
}
|
||||||
|
|
||||||
|
# Emoji-aware analyze with sensible defaults (keep + boost)
|
||||||
|
analyze_emoji() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local tags="${3:-$TAGS_CFG}"
|
||||||
|
local fixtures="${4:-$FIXTURES_CSV}"
|
||||||
|
local ch="${5:-$CH}"
|
||||||
|
local mode="${6:-keep}" # keep | demojize | strip
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i "$posts" \
|
||||||
|
--channel "$ch" \
|
||||||
|
--tags-config "$tags" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--fixtures-csv "$fixtures" \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv \
|
||||||
|
--emoji-mode "$mode" \
|
||||||
|
--emoji-boost
|
||||||
|
}
|
||||||
|
|
||||||
|
# One-shot daily pipeline: fast replies then combined analyze
|
||||||
|
run_daily() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local replies="${3:-$REPLIES_CSV}"
|
||||||
|
local conc="${4:-15}"
|
||||||
|
fast_replies "$ch" "$posts" "$replies" "$conc" "$SESSION_DIR/telegram_replies"
|
||||||
|
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
|
||||||
|
}
|
||||||
|
|
||||||
|
# One-shot daily pipeline with forwards in parallel
|
||||||
|
run_daily_with_forwards() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local replies="${3:-$REPLIES_CSV}"
|
||||||
|
local forwards="${4:-$FORWARDS_CSV}"
|
||||||
|
local rep_conc="${5:-15}"
|
||||||
|
local f_scan="${6:-20000}"
|
||||||
|
local f_conc="${7:-10}"
|
||||||
|
local f_chunk="${8:-1500}"
|
||||||
|
local sess_r="${9:-$SESSION_DIR/telegram_replies}"
|
||||||
|
local sess_f="${10:-$SESSION_DIR/telegram_forwards}"
|
||||||
|
|
||||||
|
# Launch replies and forwards in parallel with separate sessions
|
||||||
|
local pid_r pid_f
|
||||||
|
fast_replies "$ch" "$posts" "$replies" "$rep_conc" "$sess_r" & pid_r=$!
|
||||||
|
chunked_forwards "$ch" "$posts" "$forwards" "$f_scan" "$f_conc" "$f_chunk" "$sess_f" & pid_f=$!
|
||||||
|
|
||||||
|
# Wait for completion and then analyze with emoji handling
|
||||||
|
wait $pid_r
|
||||||
|
wait $pid_f
|
||||||
|
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
|
||||||
|
```zsh
|
||||||
|
# Use project defaults
|
||||||
|
fast_replies
|
||||||
|
chunked_forwards
|
||||||
|
analyze_combined
|
||||||
|
|
||||||
|
# Override on the fly (channel, files, or tuning)
|
||||||
|
fast_replies "https://t.me/AnotherChannel" data/other_posts.csv data/other_replies.csv 12
|
||||||
|
chunked_forwards "$CH" "$POSTS_CSV" data/alt_forwards.csv 30000 12 2000
|
||||||
|
analyze_combined data/other_posts.csv data/other_replies.csv "$TAGS_CFG" "$FIXTURES_CSV" "$CH"
|
||||||
|
```
|
||||||
50
docs/SESSION_HISTORY.md
Normal file
50
docs/SESSION_HISTORY.md
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
# Session history (Oct 25, 2025)
|
||||||
|
|
||||||
|
This document captures the key decisions, features added, and workflows established in the current development session so that future runs have quick context.
|
||||||
|
|
||||||
|
## Highlights
|
||||||
|
- Added a new plot: `daily_volume_and_sentiment.png` showing bars for total volume (posts+replies) and lines for positive% and negative% per day.
|
||||||
|
- Improved daily activity chart with in-plot match labels (team abbreviations), density controls, and dynamic width/height.
|
||||||
|
- Implemented matchday sentiment rollups and plots: `matchday_sentiment_overall.csv/.png`, `matchday_posts_volume_vs_sentiment.png`.
|
||||||
|
- Integrated multiple sentiment backends:
|
||||||
|
- VADER (default)
|
||||||
|
- Transformers (local model at `models/sentiment-distilbert`)
|
||||||
|
- Local GPT via Ollama (JSON {label, confidence} mapped to compound) with graceful fallback to VADER
|
||||||
|
- Labeled data workflow:
|
||||||
|
- `src/apply_labels.py` merges labels back into posts/replies as `sentiment_label`
|
||||||
|
- Analyzer reuses `sentiment_label` when present
|
||||||
|
- `src/plot_labeled.py` provides QA plots
|
||||||
|
- Convenience: created `run_all` alias to run from scratch (scrape → replies → fixtures → analyze) non-interactively.
|
||||||
|
|
||||||
|
## Key files and outputs
|
||||||
|
- Code
|
||||||
|
- `src/analyze_csv.py` — analyzer with plots and matchday integration (now with module docstring)
|
||||||
|
- `src/gpt_sentiment.py`, `src/transformer_sentiment.py`, `src/auto_label_sentiment.py`, `src/apply_labels.py`, `src/plot_labeled.py`
|
||||||
|
- `scripts/aliases.zsh` — includes `run_all`, `apply_labels_and_analyze`, and more
|
||||||
|
- Outputs (examples)
|
||||||
|
- `data/daily_activity_stacked.png`
|
||||||
|
- `data/daily_volume_and_sentiment.png`
|
||||||
|
- `data/posts_heatmap_hour_dow.png`
|
||||||
|
- `data/sentiment_by_tag_posts.png`
|
||||||
|
- `data/matchday_sentiment_overall.csv/.png`
|
||||||
|
- `data/matchday_posts_volume_vs_sentiment.png`
|
||||||
|
|
||||||
|
## Important flags (analyze)
|
||||||
|
- Sizing: `--plot-width-scale`, `--plot-max-width`, `--plot-height`
|
||||||
|
- Labels: `--activity-top-n`, `--labels-max-per-day`, `--labels-per-line`, `--labels-stagger-rows`, `--labels-band-y`, `--labels-annotate-mode`
|
||||||
|
- Sentiment backends: `--sentiment-backend vader|transformers|gpt`, plus `--transformers-model` or `--gpt-model`/`--gpt-base-url`
|
||||||
|
- Emoji: `--emoji-mode keep|demojize|strip` and `--emoji-boost`
|
||||||
|
|
||||||
|
## Aliases summary
|
||||||
|
- `run_all [CH] [START] [END] [POSTS] [REPLIES] [FIXTURES] [TAGS] [SESS_SCRAPE] [SESS_REPLIES] [CONC] [BACKEND] [MODEL] [GPT_MODEL] [GPT_URL]`
|
||||||
|
- Full pipeline non-interactive, defaults set in `scripts/aliases.zsh`
|
||||||
|
- `apply_labels_and_analyze [LABELED_CSV] [POSTS_IN] [REPLIES_IN] [POSTS_OUT] [REPLIES_OUT]`
|
||||||
|
- `analyze_transformers`, `analyze_emoji`, `analyze_combined`, `fast_replies`, `chunked_forwards`, `plot_labeled`
|
||||||
|
|
||||||
|
## Old vs New outputs
|
||||||
|
- We maintain side-by-side outputs under `data/old` and `data/new` when running legacy vs labeled pipelines.
|
||||||
|
|
||||||
|
## Next ideas
|
||||||
|
- Per-club matchday sentiment breakdowns (fixture-level small multiples)
|
||||||
|
- Side-by-side montage generation for old vs new plots
|
||||||
|
|
||||||
19
requirements.txt
Normal file
19
requirements.txt
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
numpy
|
||||||
|
pandas
|
||||||
|
scikit-learn
|
||||||
|
matplotlib
|
||||||
|
seaborn
|
||||||
|
jupyter
|
||||||
|
telethon
|
||||||
|
python-dotenv
|
||||||
|
tabulate
|
||||||
|
requests
|
||||||
|
pyyaml
|
||||||
|
vaderSentiment
|
||||||
|
emoji>=2.8.0
|
||||||
|
transformers>=4.44.0
|
||||||
|
# Torch pinning: use 2.3.1 on Python <3.13 (known-good on macOS), and a compatible newer torch on Python >=3.13
|
||||||
|
torch==2.3.1; python_version < "3.13"
|
||||||
|
torch>=2.7.1; python_version >= "3.13"
|
||||||
|
datasets>=2.20.0
|
||||||
|
accelerate>=0.26.0
|
||||||
278
run_scraper.sh
Executable file
278
run_scraper.sh
Executable file
@@ -0,0 +1,278 @@
|
|||||||
|
#!/usr/bin/env zsh
|
||||||
|
|
||||||
|
|
||||||
|
# A convenience script to set up venv, install deps, create/load .env, and run tools:
|
||||||
|
# - Telegram scraper: scrape | replies | forwards
|
||||||
|
# - Analyzer: analyze (report + sentiment + tags)
|
||||||
|
# - Fixtures: fixtures (Premier League schedule)
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Change to script directory (handles spaces in path)
|
||||||
|
cd "${0:A:h}"
|
||||||
|
|
||||||
|
PROJECT_ROOT=$(pwd)
|
||||||
|
PYTHON="${PROJECT_ROOT}/.venv/bin/python"
|
||||||
|
PIP="${PROJECT_ROOT}/.venv/bin/pip"
|
||||||
|
REQUIREMENTS_FILE="${PROJECT_ROOT}/requirements.txt"
|
||||||
|
SCRAPER_MODULE="src.telegram_scraper"
|
||||||
|
ANALYZE_MODULE="src.analyze_csv"
|
||||||
|
FIXTURES_MODULE="src.fetch_schedule"
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage:
|
||||||
|
./run_scraper.sh scrape -c <channel> -o <output> [--limit N] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--phone <number>] [--append]
|
||||||
|
./run_scraper.sh replies -c <channel> (--ids "1,2,3" | --from-csv <path>) -o <output_csv> [--append] [--min-replies N] [--concurrency K] [--resume]
|
||||||
|
./run_scraper.sh forwards -c <channel> (--ids "1,2,3" | --from-csv <path>) -o <output_csv> [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--scan-limit N] [--append] [--concurrency K] [--chunk-size M]
|
||||||
|
./run_scraper.sh analyze -i <input_csv> [-o <report_md>] [--channel @handle] [--tags-config config/tags.yaml] [--replies-csv <csv>] [--fixtures-csv <csv>] [--write-augmented-csv] [--write-combined-csv] [--emoji-mode keep|demojize|strip] [--emoji-boost] [--save-plots] [--sentiment-backend vader|transformers] [--transformers-model <hf_or_path>] [--export-transformers-details]
|
||||||
|
[--plot-width-scale <float>] [--plot-max-width <inches>] [--plot-height <inches>] [--activity-top-n <int>] \
|
||||||
|
[--labels-max-per-day <int>] [--labels-per-line <int>] [--labels-band-y <float>] [--labels-stagger-rows <int>] [--labels-annotate-mode ticks|all|ticks+top]
|
||||||
|
[--sentiment-backend gpt] [--gpt-model <name>] [--gpt-base-url <http://localhost:11434>] [--gpt-batch-size <int>]
|
||||||
|
./run_scraper.sh fixtures --start-date YYYY-MM-DD --end-date YYYY-MM-DD -o <output.{csv|json}>
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
./run_scraper.sh scrape -c @python -o data.jsonl --limit 200
|
||||||
|
./run_scraper.sh scrape -c https://t.me/python -o data.csv --start-date 2025-01-01 --end-date 2025-03-31
|
||||||
|
./run_scraper.sh replies -c @python --from-csv data/messages.csv -o data/replies.csv
|
||||||
|
./run_scraper.sh forwards -c @python --from-csv data/messages.csv -o data/forwards.csv --start-date 2025-01-01 --end-date 2025-03-31 --scan-limit 20000
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --channel @python --tags-config config/tags.yaml --replies-csv data/replies.csv --fixtures-csv data/fixtures.csv --write-augmented-csv
|
||||||
|
./run_scraper.sh analyze -i data/messages.csv --sentiment-backend transformers --transformers-model distilbert-base-uncased --export-transformers-details --write-augmented-csv --write-combined-csv
|
||||||
|
./run_scraper.sh fixtures --start-date 2025-08-15 --end-date 2025-10-15 -o data/pl_fixtures.csv
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- If .env is missing, you'll be prompted to create it when needed (Telegram or fixtures commands).
|
||||||
|
- First Telegram login will prompt for phone, code, and optionally 2FA password.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# Subcommand parsing
|
||||||
|
if [[ $# -lt 1 ]]; then
|
||||||
|
usage; exit 1
|
||||||
|
fi
|
||||||
|
COMMAND="$1"; shift || true
|
||||||
|
|
||||||
|
# Common and per-command args
|
||||||
|
CHANNEL=""; OUTPUT=""; LIMIT=""; OFFSET_DATE=""; PHONE=""; START_DATE=""; END_DATE=""; APPEND=false; SESSION_NAME=""
|
||||||
|
IDS=""; FROM_CSV=""; SCAN_LIMIT=""
|
||||||
|
INPUT_CSV=""; REPORT_OUT=""; CHANNEL_NAME=""; TAGS_CONFIG=""; REPLIES_CSV=""; FIXTURES_CSV=""; WRITE_AUG=false; WRITE_COMBINED=false; EMOJI_MODE=""; EMOJI_BOOST=false; SAVE_PLOTS=false; SENTIMENT_BACKEND=""; TRANSFORMERS_MODEL=""; EXPORT_TRANSFORMERS_DETAILS=false; PLOT_WIDTH_SCALE=""; PLOT_MAX_WIDTH=""; PLOT_HEIGHT=""; ACTIVITY_TOP_N=""; LABELS_MAX_PER_DAY=""; LABELS_PER_LINE=""; LABELS_BAND_Y=""; LABELS_STAGGER_ROWS=""; LABELS_ANNOTATE_MODE=""; GPT_MODEL=""; GPT_BASE_URL=""; GPT_BATCH_SIZE=""
|
||||||
|
|
||||||
|
case "$COMMAND" in
|
||||||
|
scrape|replies|forwards)
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
-c|--channel) CHANNEL="$2"; shift 2;;
|
||||||
|
-o|--output) OUTPUT="$2"; shift 2;;
|
||||||
|
--session-name) SESSION_NAME="$2"; shift 2;;
|
||||||
|
--limit) LIMIT="$2"; shift 2;;
|
||||||
|
--offset-date) OFFSET_DATE="$2"; shift 2;;
|
||||||
|
--start-date) START_DATE="$2"; shift 2;;
|
||||||
|
--end-date) END_DATE="$2"; shift 2;;
|
||||||
|
--scan-limit) SCAN_LIMIT="$2"; shift 2;;
|
||||||
|
--ids) IDS="$2"; shift 2;;
|
||||||
|
--from-csv) FROM_CSV="$2"; shift 2;;
|
||||||
|
--phone) PHONE="$2"; shift 2;;
|
||||||
|
--append) APPEND=true; shift;;
|
||||||
|
--min-replies) MIN_REPLIES="$2"; shift 2;;
|
||||||
|
--concurrency) CONCURRENCY="$2"; shift 2;;
|
||||||
|
--chunk-size) CHUNK_SIZE="$2"; shift 2;;
|
||||||
|
--resume) RESUME=true; shift;;
|
||||||
|
-h|--help) usage; exit 0;;
|
||||||
|
*) echo "Unknown arg: $1"; usage; exit 1;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
;;
|
||||||
|
analyze)
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
-i|--input) INPUT_CSV="$2"; shift 2;;
|
||||||
|
-o|--output) REPORT_OUT="$2"; shift 2;;
|
||||||
|
--channel) CHANNEL_NAME="$2"; shift 2;;
|
||||||
|
--tags-config) TAGS_CONFIG="$2"; shift 2;;
|
||||||
|
--replies-csv) REPLIES_CSV="$2"; shift 2;;
|
||||||
|
--fixtures-csv) FIXTURES_CSV="$2"; shift 2;;
|
||||||
|
--write-augmented-csv) WRITE_AUG=true; shift;;
|
||||||
|
--write-combined-csv) WRITE_COMBINED=true; shift;;
|
||||||
|
--emoji-mode) EMOJI_MODE="$2"; shift 2;;
|
||||||
|
--emoji-boost) EMOJI_BOOST=true; shift;;
|
||||||
|
--save-plots) SAVE_PLOTS=true; shift;;
|
||||||
|
--sentiment-backend) SENTIMENT_BACKEND="$2"; shift 2;;
|
||||||
|
--transformers-model) TRANSFORMERS_MODEL="$2"; shift 2;;
|
||||||
|
--export-transformers-details) EXPORT_TRANSFORMERS_DETAILS=true; shift;;
|
||||||
|
--gpt-model) GPT_MODEL="$2"; shift 2;;
|
||||||
|
--gpt-base-url) GPT_BASE_URL="$2"; shift 2;;
|
||||||
|
--gpt-batch-size) GPT_BATCH_SIZE="$2"; shift 2;;
|
||||||
|
--plot-width-scale) PLOT_WIDTH_SCALE="$2"; shift 2;;
|
||||||
|
--plot-max-width) PLOT_MAX_WIDTH="$2"; shift 2;;
|
||||||
|
--plot-height) PLOT_HEIGHT="$2"; shift 2;;
|
||||||
|
--activity-top-n) ACTIVITY_TOP_N="$2"; shift 2;;
|
||||||
|
--labels-max-per-day) LABELS_MAX_PER_DAY="$2"; shift 2;;
|
||||||
|
--labels-per-line) LABELS_PER_LINE="$2"; shift 2;;
|
||||||
|
--labels-band-y) LABELS_BAND_Y="$2"; shift 2;;
|
||||||
|
--labels-stagger-rows) LABELS_STAGGER_ROWS="$2"; shift 2;;
|
||||||
|
--labels-annotate-mode) LABELS_ANNOTATE_MODE="$2"; shift 2;;
|
||||||
|
-h|--help) usage; exit 0;;
|
||||||
|
*) echo "Unknown arg: $1"; usage; exit 1;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
# Defaults: always use local fine-tuned transformers model if not specified
|
||||||
|
if [[ -z "$SENTIMENT_BACKEND" ]]; then SENTIMENT_BACKEND="transformers"; fi
|
||||||
|
if [[ -z "$TRANSFORMERS_MODEL" ]]; then TRANSFORMERS_MODEL="models/sentiment-distilbert"; fi
|
||||||
|
;;
|
||||||
|
fixtures)
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--start-date) START_DATE="$2"; shift 2;;
|
||||||
|
--end-date) END_DATE="$2"; shift 2;;
|
||||||
|
-o|--output) OUTPUT="$2"; shift 2;;
|
||||||
|
-h|--help) usage; exit 0;;
|
||||||
|
*) echo "Unknown arg: $1"; usage; exit 1;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage; exit 0;;
|
||||||
|
*)
|
||||||
|
echo "Unknown command: $COMMAND"; usage; exit 1;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
# Required args validation
|
||||||
|
if [[ "$COMMAND" == "scrape" ]]; then
|
||||||
|
if [[ -z "$CHANNEL" || -z "$OUTPUT" ]]; then echo "Error: scrape needs --channel and --output"; usage; exit 1; fi
|
||||||
|
elif [[ "$COMMAND" == "replies" || "$COMMAND" == "forwards" ]]; then
|
||||||
|
if [[ -z "$CHANNEL" || -z "$OUTPUT" ]]; then echo "Error: $COMMAND needs --channel and --output"; usage; exit 1; fi
|
||||||
|
if [[ -z "$IDS" && -z "$FROM_CSV" ]]; then echo "Error: $COMMAND needs --ids or --from-csv"; usage; exit 1; fi
|
||||||
|
elif [[ "$COMMAND" == "analyze" ]]; then
|
||||||
|
if [[ -z "$INPUT_CSV" ]]; then echo "Error: analyze needs --input"; usage; exit 1; fi
|
||||||
|
elif [[ "$COMMAND" == "fixtures" ]]; then
|
||||||
|
if [[ -z "$START_DATE" || -z "$END_DATE" || -z "$OUTPUT" ]]; then echo "Error: fixtures needs --start-date, --end-date, and --output"; usage; exit 1; fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[1/4] Ensuring virtual environment..."
|
||||||
|
if [[ ! -x "$PYTHON" ]]; then
|
||||||
|
echo "Creating virtual environment at .venv"
|
||||||
|
python3 -m venv .venv
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Activating virtual environment"
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
echo "[2/4] Installing dependencies"
|
||||||
|
"$PIP" install -q --upgrade pip
|
||||||
|
"$PIP" install -q -r "$REQUIREMENTS_FILE"
|
||||||
|
|
||||||
|
echo "[3/4] Environment setup"
|
||||||
|
NEEDS_TELEGRAM=false
|
||||||
|
NEEDS_FIXTURES_TOKEN=false
|
||||||
|
if [[ "$COMMAND" == "scrape" || "$COMMAND" == "replies" || "$COMMAND" == "forwards" ]]; then NEEDS_TELEGRAM=true; fi
|
||||||
|
if [[ "$COMMAND" == "fixtures" ]]; then NEEDS_FIXTURES_TOKEN=true; fi
|
||||||
|
|
||||||
|
if [[ "$NEEDS_TELEGRAM" == true || "$NEEDS_FIXTURES_TOKEN" == true ]]; then
|
||||||
|
if [[ ! -f .env ]]; then
|
||||||
|
echo ".env not found. Let's create one now."
|
||||||
|
if [[ "$NEEDS_TELEGRAM" == true ]]; then
|
||||||
|
print -n "Enter TELEGRAM_API_ID (from my.telegram.org): "
|
||||||
|
read -r TELEGRAM_API_ID
|
||||||
|
print -n "Enter TELEGRAM_API_HASH (from my.telegram.org): "
|
||||||
|
read -r TELEGRAM_API_HASH
|
||||||
|
: ${TELEGRAM_SESSION_NAME:=telegram}
|
||||||
|
fi
|
||||||
|
cat > .env <<EOF
|
||||||
|
${TELEGRAM_API_ID:+TELEGRAM_API_ID=${TELEGRAM_API_ID}}
|
||||||
|
${TELEGRAM_API_HASH:+TELEGRAM_API_HASH=${TELEGRAM_API_HASH}}
|
||||||
|
${TELEGRAM_SESSION_NAME:+TELEGRAM_SESSION_NAME=${TELEGRAM_SESSION_NAME}}
|
||||||
|
${FOOTBALL_DATA_API_TOKEN:+FOOTBALL_DATA_API_TOKEN=${FOOTBALL_DATA_API_TOKEN}}
|
||||||
|
EOF
|
||||||
|
echo "Created .env"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Loading environment from .env"
|
||||||
|
set -a
|
||||||
|
source .env
|
||||||
|
set +a
|
||||||
|
|
||||||
|
if [[ "$NEEDS_TELEGRAM" == true ]]; then
|
||||||
|
if [[ -z "${TELEGRAM_API_ID:-}" || -z "${TELEGRAM_API_HASH:-}" ]]; then
|
||||||
|
echo "Error: TELEGRAM_API_ID and TELEGRAM_API_HASH must be set in .env"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
if [[ "$NEEDS_FIXTURES_TOKEN" == true ]]; then
|
||||||
|
if [[ -z "${FOOTBALL_DATA_API_TOKEN:-}" ]]; then
|
||||||
|
echo "Error: FOOTBALL_DATA_API_TOKEN must be set in .env for fixtures"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[4/4] Running $COMMAND"
|
||||||
|
PY_ARGS=()
|
||||||
|
case "$COMMAND" in
|
||||||
|
scrape)
|
||||||
|
PY_ARGS=( -m "$SCRAPER_MODULE" scrape "$CHANNEL" --output "$OUTPUT" )
|
||||||
|
if [[ -n "$SESSION_NAME" ]]; then PY_ARGS+=( --session-name "$SESSION_NAME" ); fi
|
||||||
|
if [[ -n "$LIMIT" ]]; then PY_ARGS+=( --limit "$LIMIT" ); fi
|
||||||
|
if [[ -n "$OFFSET_DATE" ]]; then PY_ARGS+=( --offset-date "$OFFSET_DATE" ); fi
|
||||||
|
if [[ -n "$START_DATE" ]]; then PY_ARGS+=( --start-date "$START_DATE" ); fi
|
||||||
|
if [[ -n "$END_DATE" ]]; then PY_ARGS+=( --end-date "$END_DATE" ); fi
|
||||||
|
if [[ -n "$PHONE" ]]; then PY_ARGS+=( --phone "$PHONE" ); fi
|
||||||
|
if [[ "$APPEND" == true ]]; then PY_ARGS+=( --append ); fi
|
||||||
|
;;
|
||||||
|
replies)
|
||||||
|
PY_ARGS=( -m "$SCRAPER_MODULE" replies "$CHANNEL" --output "$OUTPUT" )
|
||||||
|
if [[ -n "$SESSION_NAME" ]]; then PY_ARGS+=( --session-name "$SESSION_NAME" ); fi
|
||||||
|
if [[ -n "$IDS" ]]; then PY_ARGS+=( --ids "$IDS" ); fi
|
||||||
|
if [[ -n "$FROM_CSV" ]]; then PY_ARGS+=( --from-csv "$FROM_CSV" ); fi
|
||||||
|
if [[ -n "$PHONE" ]]; then PY_ARGS+=( --phone "$PHONE" ); fi
|
||||||
|
if [[ "$APPEND" == true ]]; then PY_ARGS+=( --append ); fi
|
||||||
|
if [[ -n "${MIN_REPLIES:-}" ]]; then PY_ARGS+=( --min-replies "$MIN_REPLIES" ); fi
|
||||||
|
if [[ -n "${CONCURRENCY:-}" ]]; then PY_ARGS+=( --concurrency "$CONCURRENCY" ); fi
|
||||||
|
if [[ "${RESUME:-false}" == true ]]; then PY_ARGS+=( --resume ); fi
|
||||||
|
;;
|
||||||
|
forwards)
|
||||||
|
PY_ARGS=( -m "$SCRAPER_MODULE" forwards "$CHANNEL" --output "$OUTPUT" )
|
||||||
|
if [[ -n "$SESSION_NAME" ]]; then PY_ARGS+=( --session-name "$SESSION_NAME" ); fi
|
||||||
|
if [[ -n "$IDS" ]]; then PY_ARGS+=( --ids "$IDS" ); fi
|
||||||
|
if [[ -n "$FROM_CSV" ]]; then PY_ARGS+=( --from-csv "$FROM_CSV" ); fi
|
||||||
|
if [[ -n "$START_DATE" ]]; then PY_ARGS+=( --start-date "$START_DATE" ); fi
|
||||||
|
if [[ -n "$END_DATE" ]]; then PY_ARGS+=( --end-date "$END_DATE" ); fi
|
||||||
|
if [[ -n "$SCAN_LIMIT" ]]; then PY_ARGS+=( --scan-limit "$SCAN_LIMIT" ); fi
|
||||||
|
if [[ -n "${CONCURRENCY:-}" ]]; then PY_ARGS+=( --concurrency "$CONCURRENCY" ); fi
|
||||||
|
if [[ -n "${CHUNK_SIZE:-}" ]]; then PY_ARGS+=( --chunk-size "$CHUNK_SIZE" ); fi
|
||||||
|
if [[ -n "$PHONE" ]]; then PY_ARGS+=( --phone "$PHONE" ); fi
|
||||||
|
if [[ "$APPEND" == true ]]; then PY_ARGS+=( --append ); fi
|
||||||
|
;;
|
||||||
|
analyze)
|
||||||
|
PY_ARGS=( -m "$ANALYZE_MODULE" "$INPUT_CSV" )
|
||||||
|
if [[ -n "$REPORT_OUT" ]]; then PY_ARGS+=( -o "$REPORT_OUT" ); fi
|
||||||
|
if [[ -n "$CHANNEL_NAME" ]]; then PY_ARGS+=( --channel "$CHANNEL_NAME" ); fi
|
||||||
|
if [[ -n "$TAGS_CONFIG" ]]; then PY_ARGS+=( --tags-config "$TAGS_CONFIG" ); fi
|
||||||
|
if [[ -n "$REPLIES_CSV" ]]; then PY_ARGS+=( --replies-csv "$REPLIES_CSV" ); fi
|
||||||
|
if [[ -n "$FIXTURES_CSV" ]]; then PY_ARGS+=( --fixtures-csv "$FIXTURES_CSV" ); fi
|
||||||
|
if [[ "$WRITE_AUG" == true ]]; then PY_ARGS+=( --write-augmented-csv ); fi
|
||||||
|
if [[ "$WRITE_COMBINED" == true ]]; then PY_ARGS+=( --write-combined-csv ); fi
|
||||||
|
if [[ -n "$EMOJI_MODE" ]]; then PY_ARGS+=( --emoji-mode "$EMOJI_MODE" ); fi
|
||||||
|
if [[ "${EMOJI_BOOST:-false}" == true ]]; then PY_ARGS+=( --emoji-boost ); fi
|
||||||
|
if [[ "${SAVE_PLOTS:-false}" == true ]]; then PY_ARGS+=( --save-plots ); fi
|
||||||
|
if [[ -n "$SENTIMENT_BACKEND" ]]; then PY_ARGS+=( --sentiment-backend "$SENTIMENT_BACKEND" ); fi
|
||||||
|
if [[ -n "$TRANSFORMERS_MODEL" ]]; then PY_ARGS+=( --transformers-model "$TRANSFORMERS_MODEL" ); fi
|
||||||
|
if [[ "${EXPORT_TRANSFORMERS_DETAILS:-false}" == true ]]; then PY_ARGS+=( --export-transformers-details ); fi
|
||||||
|
if [[ -n "$GPT_MODEL" ]]; then PY_ARGS+=( --gpt-model "$GPT_MODEL" ); fi
|
||||||
|
if [[ -n "$GPT_BASE_URL" ]]; then PY_ARGS+=( --gpt-base-url "$GPT_BASE_URL" ); fi
|
||||||
|
if [[ -n "$GPT_BATCH_SIZE" ]]; then PY_ARGS+=( --gpt-batch-size "$GPT_BATCH_SIZE" ); fi
|
||||||
|
if [[ -n "$PLOT_WIDTH_SCALE" ]]; then PY_ARGS+=( --plot-width-scale "$PLOT_WIDTH_SCALE" ); fi
|
||||||
|
if [[ -n "$PLOT_MAX_WIDTH" ]]; then PY_ARGS+=( --plot-max-width "$PLOT_MAX_WIDTH" ); fi
|
||||||
|
if [[ -n "$PLOT_HEIGHT" ]]; then PY_ARGS+=( --plot-height "$PLOT_HEIGHT" ); fi
|
||||||
|
if [[ -n "$ACTIVITY_TOP_N" ]]; then PY_ARGS+=( --activity-top-n "$ACTIVITY_TOP_N" ); fi
|
||||||
|
if [[ -n "$LABELS_MAX_PER_DAY" ]]; then PY_ARGS+=( --labels-max-per-day "$LABELS_MAX_PER_DAY" ); fi
|
||||||
|
if [[ -n "$LABELS_PER_LINE" ]]; then PY_ARGS+=( --labels-per-line "$LABELS_PER_LINE" ); fi
|
||||||
|
if [[ -n "$LABELS_BAND_Y" ]]; then PY_ARGS+=( --labels-band-y "$LABELS_BAND_Y" ); fi
|
||||||
|
if [[ -n "$LABELS_STAGGER_ROWS" ]]; then PY_ARGS+=( --labels-stagger-rows "$LABELS_STAGGER_ROWS" ); fi
|
||||||
|
if [[ -n "$LABELS_ANNOTATE_MODE" ]]; then PY_ARGS+=( --labels-annotate-mode "$LABELS_ANNOTATE_MODE" ); fi
|
||||||
|
;;
|
||||||
|
fixtures)
|
||||||
|
PY_ARGS=( -m "$FIXTURES_MODULE" --start-date "$START_DATE" --end-date "$END_DATE" -o "$OUTPUT" )
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
echo "Command: $PYTHON ${PY_ARGS[@]}"
|
||||||
|
"$PYTHON" ${PY_ARGS[@]}
|
||||||
338
scripts/aliases.zsh
Normal file
338
scripts/aliases.zsh
Normal file
@@ -0,0 +1,338 @@
|
|||||||
|
# Convenience aliases for daily runs (zsh)
|
||||||
|
# Source this file in your shell: source scripts/aliases.zsh
|
||||||
|
|
||||||
|
# --- Project defaults (edit as needed) ---
|
||||||
|
# Channel and files
|
||||||
|
export CH="https://t.me/Premier_League_Update"
|
||||||
|
export POSTS_CSV="data/premier_league_update.csv"
|
||||||
|
export REPLIES_CSV="data/premier_league_replies.csv"
|
||||||
|
export FORWARDS_CSV="data/premier_league_forwards.csv"
|
||||||
|
export TAGS_CFG="config/tags.yaml"
|
||||||
|
export FIXTURES_CSV="data/premier_league_schedule_2025-08-15_to_2025-10-15.csv"
|
||||||
|
# Default fixtures date range (used by run_all)
|
||||||
|
export FIXTURES_START_DATE="2025-08-15"
|
||||||
|
export FIXTURES_END_DATE="2025-10-15"
|
||||||
|
|
||||||
|
# Sessions directory outside iCloud (avoid sqlite locks)
|
||||||
|
export SESSION_DIR="$HOME/.local/share/telethon_sessions"
|
||||||
|
mkdir -p "$SESSION_DIR"
|
||||||
|
|
||||||
|
# --- Aliases (zsh functions) ---
|
||||||
|
|
||||||
|
# Fast replies: resume+append, prioritizes parents with replies, tuned concurrency
|
||||||
|
fast_replies() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local out="${3:-$REPLIES_CSV}"
|
||||||
|
local conc="${4:-15}"
|
||||||
|
local sess="${5:-$SESSION_DIR/telegram_replies}"
|
||||||
|
./run_scraper.sh replies \
|
||||||
|
-c "$ch" \
|
||||||
|
--from-csv "$posts" \
|
||||||
|
-o "$out" \
|
||||||
|
--min-replies 1 \
|
||||||
|
--concurrency "$conc" \
|
||||||
|
--resume \
|
||||||
|
--append \
|
||||||
|
--session-name "$sess"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Chunked forwards: concurrent chunk scan with progress logs
|
||||||
|
chunked_forwards() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local out="${3:-$FORWARDS_CSV}"
|
||||||
|
local scan="${4:-20000}"
|
||||||
|
local conc="${5:-10}"
|
||||||
|
local chunk="${6:-1500}"
|
||||||
|
local sess="${7:-$SESSION_DIR/telegram_forwards}"
|
||||||
|
./run_scraper.sh forwards \
|
||||||
|
-c "$ch" \
|
||||||
|
--from-csv "$posts" \
|
||||||
|
-o "$out" \
|
||||||
|
--scan-limit "$scan" \
|
||||||
|
--concurrency "$conc" \
|
||||||
|
--chunk-size "$chunk" \
|
||||||
|
--append \
|
||||||
|
--session-name "$sess"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Combined analyze: posts + replies + fixtures with tags; writes augmented CSVs
|
||||||
|
analyze_combined() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local tags="${3:-$TAGS_CFG}"
|
||||||
|
local fixtures="${4:-$FIXTURES_CSV}"
|
||||||
|
local ch="${5:-$CH}"
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i "$posts" \
|
||||||
|
--channel "$ch" \
|
||||||
|
--tags-config "$tags" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--fixtures-csv "$fixtures" \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv \
|
||||||
|
--save-plots
|
||||||
|
# Tip: add plot sizing/labels, e.g.: --plot-width-scale 0.8 --plot-max-width 120 --plot-height 8 --activity-top-n 8 --labels-stagger-rows 3
|
||||||
|
}
|
||||||
|
|
||||||
|
# Emoji-aware analyze with sensible defaults (keep + boost)
|
||||||
|
analyze_emoji() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local tags="${3:-$TAGS_CFG}"
|
||||||
|
local fixtures="${4:-$FIXTURES_CSV}"
|
||||||
|
local ch="${5:-$CH}"
|
||||||
|
local mode="${6:-keep}" # keep | demojize | strip
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i "$posts" \
|
||||||
|
--channel "$ch" \
|
||||||
|
--tags-config "$tags" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--fixtures-csv "$fixtures" \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv \
|
||||||
|
--save-plots \
|
||||||
|
--emoji-mode "$mode" \
|
||||||
|
--emoji-boost
|
||||||
|
}
|
||||||
|
|
||||||
|
# Analyze with transformers (and export labels/probs)
|
||||||
|
analyze_transformers() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local tags="${3:-$TAGS_CFG}"
|
||||||
|
local fixtures="${4:-$FIXTURES_CSV}"
|
||||||
|
local ch="${5:-$CH}"
|
||||||
|
local model="${6:-distilbert-base-uncased}"
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i "$posts" \
|
||||||
|
--channel "$ch" \
|
||||||
|
--tags-config "$tags" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--fixtures-csv "$fixtures" \
|
||||||
|
--sentiment-backend transformers \
|
||||||
|
--transformers-model "$model" \
|
||||||
|
--export-transformers-details \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv \
|
||||||
|
--save-plots
|
||||||
|
}
|
||||||
|
|
||||||
|
# Plot graphs from labeled sentiment CSV
|
||||||
|
plot_labeled() {
|
||||||
|
local labeled_csv="${1:-data/labeled_sentiment.csv}"
|
||||||
|
local out_dir="${2:-data}"
|
||||||
|
./.venv/bin/python -m src.plot_labeled \
|
||||||
|
--input "$labeled_csv" \
|
||||||
|
--out-dir "$out_dir"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Merge labeled CSV back into posts/replies to reuse analyzer plots
|
||||||
|
apply_labels_and_analyze() {
|
||||||
|
local labeled_csv="${1:-data/labeled_sentiment.csv}"
|
||||||
|
local posts_in="${2:-$POSTS_CSV}"
|
||||||
|
local replies_in="${3:-$REPLIES_CSV}"
|
||||||
|
local posts_out="${4:-data/premier_league_update_with_labels.csv}"
|
||||||
|
local replies_out="${5:-data/premier_league_replies_with_labels.csv}"
|
||||||
|
./.venv/bin/python -m src.apply_labels \
|
||||||
|
--labeled-csv "$labeled_csv" \
|
||||||
|
--posts-csv "$posts_in" \
|
||||||
|
--replies-csv "$replies_in" \
|
||||||
|
--posts-out "$posts_out" \
|
||||||
|
--replies-out "$replies_out"
|
||||||
|
# Reuse analyzer with the merged CSVs; it will pick up sentiment_label if present
|
||||||
|
./run_scraper.sh analyze \
|
||||||
|
-i "$posts_out" \
|
||||||
|
--replies-csv "$replies_out" \
|
||||||
|
--fixtures-csv "$FIXTURES_CSV" \
|
||||||
|
--tags-config "$TAGS_CFG" \
|
||||||
|
--write-augmented-csv \
|
||||||
|
--write-combined-csv \
|
||||||
|
--save-plots
|
||||||
|
}
|
||||||
|
|
||||||
|
# Auto-label sentiment without manual annotation (VADER backend)
|
||||||
|
auto_label_vader() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local out="${3:-data/labeled_sentiment.csv}"
|
||||||
|
./.venv/bin/python -m src.auto_label_sentiment \
|
||||||
|
--posts-csv "$posts" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--backend vader \
|
||||||
|
--vader-pos 0.05 \
|
||||||
|
--vader-neg -0.05 \
|
||||||
|
--vader-margin 0.20 \
|
||||||
|
--only-confident \
|
||||||
|
-o "$out"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Auto-label sentiment using a pretrained transformers model
|
||||||
|
auto_label_transformers() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local model="${3:-cardiffnlp/twitter-roberta-base-sentiment-latest}"
|
||||||
|
local out="${4:-data/labeled_sentiment.csv}"
|
||||||
|
./.venv/bin/python -m src.auto_label_sentiment \
|
||||||
|
--posts-csv "$posts" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--backend transformers \
|
||||||
|
--transformers-model "$model" \
|
||||||
|
--min-prob 0.6 \
|
||||||
|
--min-margin 0.2 \
|
||||||
|
--only-confident \
|
||||||
|
-o "$out"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Train a transformers model with the project venv
|
||||||
|
train_transformers() {
|
||||||
|
local train_csv="${1:-data/labeled_sentiment.csv}"
|
||||||
|
local text_col="${2:-message}"
|
||||||
|
local label_col="${3:-label}"
|
||||||
|
local base_model="${4:-distilbert-base-uncased}"
|
||||||
|
local out_dir="${5:-models/sentiment-distilbert}"
|
||||||
|
./.venv/bin/python -m src.train_sentiment \
|
||||||
|
--train-csv "$train_csv" \
|
||||||
|
--text-col "$text_col" \
|
||||||
|
--label-col "$label_col" \
|
||||||
|
--model-name "$base_model" \
|
||||||
|
--output-dir "$out_dir" \
|
||||||
|
--epochs 3 \
|
||||||
|
--batch-size 16
|
||||||
|
}
|
||||||
|
|
||||||
|
# Evaluate a fine-tuned transformers model
|
||||||
|
eval_transformers() {
|
||||||
|
local csv="${1:-data/labeled_holdout.csv}"
|
||||||
|
local text_col="${2:-message}"
|
||||||
|
local label_col="${3:-label}"
|
||||||
|
local model_dir="${4:-models/sentiment-distilbert}"
|
||||||
|
./.venv/bin/python -m src.eval_sentiment \
|
||||||
|
--csv "$csv" \
|
||||||
|
--text-col "$text_col" \
|
||||||
|
--label-col "$label_col" \
|
||||||
|
--model "$model_dir"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Build a labeling CSV from existing posts+replies
|
||||||
|
make_label_set() {
|
||||||
|
local posts="${1:-$POSTS_CSV}"
|
||||||
|
local replies="${2:-$REPLIES_CSV}"
|
||||||
|
local out="${3:-data/labeled_sentiment.csv}"
|
||||||
|
local n="${4:-1000}"
|
||||||
|
./.venv/bin/python -m src.make_labeling_set \
|
||||||
|
--posts-csv "$posts" \
|
||||||
|
--replies-csv "$replies" \
|
||||||
|
--sample-size "$n" \
|
||||||
|
-o "$out"
|
||||||
|
}
|
||||||
|
|
||||||
|
# One-shot daily pipeline: fast replies then combined analyze
|
||||||
|
run_daily() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local replies="${3:-$REPLIES_CSV}"
|
||||||
|
local conc="${4:-15}"
|
||||||
|
fast_replies "$ch" "$posts" "$replies" "$conc" "$SESSION_DIR/telegram_replies"
|
||||||
|
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
|
||||||
|
}
|
||||||
|
|
||||||
|
# One-shot daily pipeline with forwards in parallel
|
||||||
|
run_daily_with_forwards() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local posts="${2:-$POSTS_CSV}"
|
||||||
|
local replies="${3:-$REPLIES_CSV}"
|
||||||
|
local forwards="${4:-$FORWARDS_CSV}"
|
||||||
|
local rep_conc="${5:-15}"
|
||||||
|
local f_scan="${6:-20000}"
|
||||||
|
local f_conc="${7:-10}"
|
||||||
|
local f_chunk="${8:-1500}"
|
||||||
|
local sess_r="${9:-$SESSION_DIR/telegram_replies}"
|
||||||
|
local sess_f="${10:-$SESSION_DIR/telegram_forwards}"
|
||||||
|
|
||||||
|
# Launch replies and forwards in parallel with separate sessions
|
||||||
|
local pid_r pid_f
|
||||||
|
fast_replies "$ch" "$posts" "$replies" "$rep_conc" "$sess_r" & pid_r=$!
|
||||||
|
chunked_forwards "$ch" "$posts" "$forwards" "$f_scan" "$f_conc" "$f_chunk" "$sess_f" & pid_f=$!
|
||||||
|
|
||||||
|
# Wait for completion and then analyze with emoji handling
|
||||||
|
wait $pid_r
|
||||||
|
wait $pid_f
|
||||||
|
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
|
||||||
|
}
|
||||||
|
|
||||||
|
# End-to-end, non-interactive pipeline (from scratch): scrape -> replies -> fixtures -> analyze
|
||||||
|
# Requirements:
|
||||||
|
# - .env has TELEGRAM_API_ID and TELEGRAM_API_HASH (and TELEGRAM_2FA_PASSWORD if 2FA is enabled)
|
||||||
|
# - CH/POSTS_CSV/REPLIES_CSV/FIXTURES_CSV/TAGS_CFG are set (defaults are defined above)
|
||||||
|
# - Provide optional start/end dates; defaults use FIXTURES_START_DATE/FIXTURES_END_DATE
|
||||||
|
# - Choose sentiment backend via arg 11: vader | transformers | gpt (default: transformers)
|
||||||
|
run_all() {
|
||||||
|
local ch="${1:-$CH}"
|
||||||
|
local start="${2:-$FIXTURES_START_DATE}"
|
||||||
|
local end="${3:-$FIXTURES_END_DATE}"
|
||||||
|
local posts="${4:-$POSTS_CSV}"
|
||||||
|
local replies="${5:-$REPLIES_CSV}"
|
||||||
|
local fixtures="${6:-$FIXTURES_CSV}"
|
||||||
|
local tags="${7:-$TAGS_CFG}"
|
||||||
|
local sess_scrape="${8:-$SESSION_DIR/telegram_scrape}"
|
||||||
|
local sess_replies="${9:-$SESSION_DIR/telegram_replies}"
|
||||||
|
local rep_conc="${10:-15}"
|
||||||
|
local backend="${11:-transformers}" # vader | transformers | gpt
|
||||||
|
local model="${12:-models/sentiment-distilbert}"
|
||||||
|
local gpt_model="${13:-llama3}"
|
||||||
|
local gpt_url="${14:-http://localhost:11434}"
|
||||||
|
|
||||||
|
# 1) Scrape posts (overwrite)
|
||||||
|
./run_scraper.sh scrape \
|
||||||
|
-c "$ch" \
|
||||||
|
-o "$posts" \
|
||||||
|
--start-date "$start" \
|
||||||
|
--end-date "$end" \
|
||||||
|
--session-name "$sess_scrape"
|
||||||
|
|
||||||
|
# 2) Fetch replies (resume+append safe)
|
||||||
|
./run_scraper.sh replies \
|
||||||
|
-c "$ch" \
|
||||||
|
--from-csv "$posts" \
|
||||||
|
-o "$replies" \
|
||||||
|
--min-replies 1 \
|
||||||
|
--concurrency "$rep_conc" \
|
||||||
|
--resume \
|
||||||
|
--append \
|
||||||
|
--session-name "$sess_replies"
|
||||||
|
|
||||||
|
# 3) Fetch fixtures for the same period
|
||||||
|
./run_scraper.sh fixtures \
|
||||||
|
--start-date "$start" \
|
||||||
|
--end-date "$end" \
|
||||||
|
-o "$fixtures"
|
||||||
|
|
||||||
|
# 4) Analyze with plots (non-interactive)
|
||||||
|
local args=(
|
||||||
|
-i "$posts"
|
||||||
|
--tags-config "$tags"
|
||||||
|
--replies-csv "$replies"
|
||||||
|
--fixtures-csv "$fixtures"
|
||||||
|
--write-augmented-csv
|
||||||
|
--write-combined-csv
|
||||||
|
--emoji-mode keep
|
||||||
|
--emoji-boost
|
||||||
|
--save-plots
|
||||||
|
--plot-width-scale 0.8
|
||||||
|
--plot-max-width 120
|
||||||
|
--plot-height 8
|
||||||
|
--activity-top-n 8
|
||||||
|
--labels-stagger-rows 3
|
||||||
|
)
|
||||||
|
if [[ "$backend" == "transformers" ]]; then
|
||||||
|
args+=( --sentiment-backend transformers --transformers-model "$model" --export-transformers-details )
|
||||||
|
elif [[ "$backend" == "gpt" ]]; then
|
||||||
|
args+=( --sentiment-backend gpt --gpt-model "$gpt_model" --gpt-base-url "$gpt_url" )
|
||||||
|
else
|
||||||
|
args+=( --sentiment-backend vader )
|
||||||
|
fi
|
||||||
|
|
||||||
|
./run_scraper.sh analyze "${args[@]}"
|
||||||
|
}
|
||||||
1
src/__init__.py
Normal file
1
src/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# This file is intentionally left blank.
|
||||||
1313
src/analyze_csv.py
Normal file
1313
src/analyze_csv.py
Normal file
File diff suppressed because it is too large
Load Diff
50
src/apply_labels.py
Normal file
50
src/apply_labels.py
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
|
||||||
|
def read_csv(path: str) -> pd.DataFrame:
|
||||||
|
if not os.path.exists(path):
|
||||||
|
raise SystemExit(f"CSV not found: {path}")
|
||||||
|
return pd.read_csv(path)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
p = argparse.ArgumentParser(description='Apply labeled sentiments to posts/replies CSVs for analysis plots.')
|
||||||
|
p.add_argument('--labeled-csv', required=True, help='Path to labeled_sentiment.csv (must include id and label columns)')
|
||||||
|
p.add_argument('--posts-csv', required=True, help='Original posts CSV')
|
||||||
|
p.add_argument('--replies-csv', required=True, help='Original replies CSV')
|
||||||
|
p.add_argument('--posts-out', default=None, help='Output posts CSV path (default: <posts> with _with_labels suffix)')
|
||||||
|
p.add_argument('--replies-out', default=None, help='Output replies CSV path (default: <replies> with _with_labels suffix)')
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
labeled = read_csv(args.labeled_csv)
|
||||||
|
if 'id' not in labeled.columns:
|
||||||
|
raise SystemExit('labeled CSV must include an id column to merge on')
|
||||||
|
# normalize label column name to sentiment_label
|
||||||
|
lab_col = 'label' if 'label' in labeled.columns else ('sentiment_label' if 'sentiment_label' in labeled.columns else None)
|
||||||
|
if lab_col is None:
|
||||||
|
raise SystemExit("labeled CSV must include a 'label' or 'sentiment_label' column")
|
||||||
|
labeled = labeled[['id', lab_col] + (['confidence'] if 'confidence' in labeled.columns else [])].copy()
|
||||||
|
labeled = labeled.rename(columns={lab_col: 'sentiment_label'})
|
||||||
|
|
||||||
|
posts = read_csv(args.posts_csv)
|
||||||
|
replies = read_csv(args.replies_csv)
|
||||||
|
|
||||||
|
if 'id' not in posts.columns or 'id' not in replies.columns:
|
||||||
|
raise SystemExit('posts/replies CSVs must include id columns')
|
||||||
|
|
||||||
|
posts_out = args.posts_out or os.path.splitext(args.posts_csv)[0] + '_with_labels.csv'
|
||||||
|
replies_out = args.replies_out or os.path.splitext(args.replies_csv)[0] + '_with_labels.csv'
|
||||||
|
|
||||||
|
posts_merged = posts.merge(labeled, how='left', on='id', validate='m:1')
|
||||||
|
replies_merged = replies.merge(labeled, how='left', on='id', validate='m:1')
|
||||||
|
|
||||||
|
posts_merged.to_csv(posts_out, index=False)
|
||||||
|
replies_merged.to_csv(replies_out, index=False)
|
||||||
|
print(f"Wrote posts with labels -> {posts_out} (rows={len(posts_merged)})")
|
||||||
|
print(f"Wrote replies with labels -> {replies_out} (rows={len(replies_merged)})")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
107
src/audit_team_sentiment.py
Normal file
107
src/audit_team_sentiment.py
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
|
||||||
|
|
||||||
|
|
||||||
|
def parse_tags_column(series: pd.Series) -> pd.Series:
|
||||||
|
def _to_list(x):
|
||||||
|
if isinstance(x, list):
|
||||||
|
return x
|
||||||
|
if pd.isna(x):
|
||||||
|
return []
|
||||||
|
s = str(x)
|
||||||
|
# Expect semicolon-delimited from augmented CSV, but also accept comma
|
||||||
|
if ';' in s:
|
||||||
|
return [t.strip() for t in s.split(';') if t.strip()]
|
||||||
|
if ',' in s:
|
||||||
|
return [t.strip() for t in s.split(',') if t.strip()]
|
||||||
|
return [s] if s else []
|
||||||
|
return series.apply(_to_list)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Audit sentiment per team tag and export samples for inspection.')
|
||||||
|
parser.add_argument('--csv', default='data/premier_league_update_tagged.csv', help='Tagged posts CSV (augmented by analyze)')
|
||||||
|
parser.add_argument('--team', default='club_manchester_united', help='Team tag to export samples for (e.g., club_manchester_united)')
|
||||||
|
parser.add_argument('--out-dir', default='data', help='Directory to write audit outputs')
|
||||||
|
parser.add_argument('--samples', type=int, default=25, help='Number of samples to export for the specified team')
|
||||||
|
parser.add_argument('--with-vader', action='store_true', help='Also compute VADER-based sentiment shares as a sanity check')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if not os.path.exists(args.csv):
|
||||||
|
raise SystemExit(f"CSV not found: {args.csv}. Run analyze with --write-augmented-csv first.")
|
||||||
|
|
||||||
|
df = pd.read_csv(args.csv)
|
||||||
|
if 'message' not in df.columns:
|
||||||
|
raise SystemExit('CSV missing message column')
|
||||||
|
if 'sentiment_compound' not in df.columns:
|
||||||
|
raise SystemExit('CSV missing sentiment_compound column')
|
||||||
|
if 'tags' not in df.columns:
|
||||||
|
raise SystemExit('CSV missing tags column')
|
||||||
|
|
||||||
|
df = df.copy()
|
||||||
|
df['tags'] = parse_tags_column(df['tags'])
|
||||||
|
# Filter to team tags (prefix club_)
|
||||||
|
e = df.explode('tags')
|
||||||
|
e = e[e['tags'].notna() & (e['tags'] != '')]
|
||||||
|
e = e[e['tags'].astype(str).str.startswith('club_')]
|
||||||
|
e = e.dropna(subset=['sentiment_compound'])
|
||||||
|
if e.empty:
|
||||||
|
print('No team-tagged rows found.')
|
||||||
|
return
|
||||||
|
|
||||||
|
# Shares
|
||||||
|
e = e.copy()
|
||||||
|
e['is_pos'] = e['sentiment_compound'] > 0.05
|
||||||
|
e['is_neg'] = e['sentiment_compound'] < -0.05
|
||||||
|
grp = (
|
||||||
|
e.groupby('tags')
|
||||||
|
.agg(
|
||||||
|
n=('sentiment_compound', 'count'),
|
||||||
|
mean=('sentiment_compound', 'mean'),
|
||||||
|
median=('sentiment_compound', 'median'),
|
||||||
|
pos_share=('is_pos', 'mean'),
|
||||||
|
neg_share=('is_neg', 'mean'),
|
||||||
|
)
|
||||||
|
.reset_index()
|
||||||
|
)
|
||||||
|
grp['neu_share'] = (1 - grp['pos_share'] - grp['neg_share']).clip(lower=0)
|
||||||
|
grp = grp.sort_values(['n', 'mean'], ascending=[False, False])
|
||||||
|
|
||||||
|
if args.with_vader:
|
||||||
|
# Compute VADER shares on the underlying messages per team
|
||||||
|
analyzer = SentimentIntensityAnalyzer()
|
||||||
|
def _vader_sentiment_share(sub: pd.DataFrame):
|
||||||
|
if sub.empty:
|
||||||
|
return pd.Series({'pos_share_vader': 0.0, 'neg_share_vader': 0.0, 'neu_share_vader': 0.0})
|
||||||
|
scores = sub['message'].astype(str).apply(lambda t: analyzer.polarity_scores(t or '')['compound'])
|
||||||
|
pos = (scores > 0.05).mean()
|
||||||
|
neg = (scores < -0.05).mean()
|
||||||
|
neu = max(0.0, 1.0 - pos - neg)
|
||||||
|
return pd.Series({'pos_share_vader': pos, 'neg_share_vader': neg, 'neu_share_vader': neu})
|
||||||
|
vader_grp = e.groupby('tags').apply(_vader_sentiment_share).reset_index()
|
||||||
|
grp = grp.merge(vader_grp, on='tags', how='left')
|
||||||
|
|
||||||
|
os.makedirs(args.out_dir, exist_ok=True)
|
||||||
|
out_summary = os.path.join(args.out_dir, 'team_sentiment_audit.csv')
|
||||||
|
grp.to_csv(out_summary, index=False)
|
||||||
|
print(f"Wrote summary: {out_summary}")
|
||||||
|
|
||||||
|
# Export samples for selected team
|
||||||
|
te = e[e['tags'] == args.team].copy()
|
||||||
|
if te.empty:
|
||||||
|
print(f"No rows for team tag: {args.team}")
|
||||||
|
return
|
||||||
|
# Sort by sentiment descending to inspect highly positive claims
|
||||||
|
te = te.sort_values('sentiment_compound', ascending=False)
|
||||||
|
cols = [c for c in ['id', 'date', 'message', 'sentiment_compound', 'url'] if c in te.columns]
|
||||||
|
samples_path = os.path.join(args.out_dir, f"{args.team}_samples.csv")
|
||||||
|
te[cols].head(args.samples).to_csv(samples_path, index=False)
|
||||||
|
print(f"Wrote samples: {samples_path} ({min(args.samples, len(te))} rows)")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
218
src/auto_label_sentiment.py
Normal file
218
src/auto_label_sentiment.py
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
from typing import List, Optional, Tuple
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Allow both package and direct script execution
|
||||||
|
from .make_labeling_set import load_messages as _load_messages
|
||||||
|
except Exception:
|
||||||
|
from make_labeling_set import load_messages as _load_messages
|
||||||
|
|
||||||
|
|
||||||
|
def _combine_inputs(posts_csv: Optional[str], replies_csv: Optional[str], text_col: str = 'message', min_length: int = 3) -> pd.DataFrame:
|
||||||
|
frames: List[pd.DataFrame] = []
|
||||||
|
if posts_csv:
|
||||||
|
frames.append(_load_messages(posts_csv, text_col=text_col))
|
||||||
|
if replies_csv:
|
||||||
|
# include parent_id if present for replies
|
||||||
|
frames.append(_load_messages(replies_csv, text_col=text_col, extra_cols=['parent_id']))
|
||||||
|
if not frames:
|
||||||
|
raise SystemExit('No input provided. Use --input-csv or --posts-csv/--replies-csv')
|
||||||
|
df = pd.concat(frames, ignore_index=True)
|
||||||
|
df['message'] = df['message'].fillna('').astype(str)
|
||||||
|
df = df[df['message'].str.len() >= min_length]
|
||||||
|
df = df.drop_duplicates(subset=['message']).reset_index(drop=True)
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def _map_label_str_to_int(labels: List[str]) -> List[int]:
|
||||||
|
mapping = {'neg': 0, 'negative': 0, 'neu': 1, 'neutral': 1, 'pos': 2, 'positive': 2}
|
||||||
|
out: List[int] = []
|
||||||
|
for lab in labels:
|
||||||
|
lab_l = (lab or '').lower()
|
||||||
|
if lab_l in mapping:
|
||||||
|
out.append(mapping[lab_l])
|
||||||
|
else:
|
||||||
|
# fallback: try to parse integer
|
||||||
|
try:
|
||||||
|
out.append(int(lab))
|
||||||
|
except Exception:
|
||||||
|
out.append(1) # default to neutral
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _vader_label(compound: float, pos_th: float, neg_th: float) -> str:
|
||||||
|
if compound >= pos_th:
|
||||||
|
return 'pos'
|
||||||
|
if compound <= neg_th:
|
||||||
|
return 'neg'
|
||||||
|
return 'neu'
|
||||||
|
|
||||||
|
|
||||||
|
def _auto_label_vader(texts: List[str], pos_th: float, neg_th: float, min_margin: float) -> Tuple[List[str], List[float]]:
|
||||||
|
analyzer = SentimentIntensityAnalyzer()
|
||||||
|
labels: List[str] = []
|
||||||
|
confs: List[float] = []
|
||||||
|
for t in texts:
|
||||||
|
s = analyzer.polarity_scores(t or '')
|
||||||
|
comp = float(s.get('compound', 0.0))
|
||||||
|
lab = _vader_label(comp, pos_th, neg_th)
|
||||||
|
# Confidence heuristic: distance from neutral band edges
|
||||||
|
if lab == 'pos':
|
||||||
|
conf = max(0.0, comp - pos_th)
|
||||||
|
elif lab == 'neg':
|
||||||
|
conf = max(0.0, abs(comp - neg_th))
|
||||||
|
else:
|
||||||
|
# closer to 0 is more neutral; confidence inversely related to |compound|
|
||||||
|
conf = max(0.0, (pos_th - abs(comp)))
|
||||||
|
labels.append(lab)
|
||||||
|
confs.append(conf)
|
||||||
|
# Normalize confidence roughly to [0,1] by clipping with a reasonable scale
|
||||||
|
confs = [min(1.0, c / max(1e-6, min_margin)) for c in confs]
|
||||||
|
return labels, confs
|
||||||
|
|
||||||
|
|
||||||
|
def _auto_label_transformers(texts: List[str], model_name_or_path: str, batch_size: int, min_prob: float, min_margin: float) -> Tuple[List[str], List[float]]:
|
||||||
|
try:
|
||||||
|
from .transformer_sentiment import TransformerSentiment
|
||||||
|
except Exception:
|
||||||
|
from transformer_sentiment import TransformerSentiment
|
||||||
|
|
||||||
|
clf = TransformerSentiment(model_name_or_path)
|
||||||
|
probs_all, labels_all = clf.predict_probs_and_labels(texts, batch_size=batch_size)
|
||||||
|
confs: List[float] = []
|
||||||
|
for row in probs_all:
|
||||||
|
row = np.array(row, dtype=float)
|
||||||
|
if row.size == 0:
|
||||||
|
confs.append(0.0)
|
||||||
|
continue
|
||||||
|
top2 = np.sort(row)[-2:] if row.size >= 2 else np.array([0.0, row.max()])
|
||||||
|
max_p = float(row.max())
|
||||||
|
margin = float(top2[-1] - top2[-2]) if row.size >= 2 else max_p
|
||||||
|
# Confidence must satisfy both conditions
|
||||||
|
conf = min(max(0.0, (max_p - min_prob) / max(1e-6, 1 - min_prob)), max(0.0, margin / max(1e-6, min_margin)))
|
||||||
|
confs.append(conf)
|
||||||
|
# Map arbitrary id2label names to canonical 'neg/neu/pos' when obvious; else keep as-is
|
||||||
|
canonical = []
|
||||||
|
for lab in labels_all:
|
||||||
|
ll = (lab or '').lower()
|
||||||
|
if 'neg' in ll:
|
||||||
|
canonical.append('neg')
|
||||||
|
elif 'neu' in ll or 'neutral' in ll:
|
||||||
|
canonical.append('neu')
|
||||||
|
elif 'pos' in ll or 'positive' in ll:
|
||||||
|
canonical.append('pos')
|
||||||
|
else:
|
||||||
|
canonical.append(lab)
|
||||||
|
return canonical, confs
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Automatically label sentiment without manual annotation.')
|
||||||
|
src = parser.add_mutually_exclusive_group(required=True)
|
||||||
|
src.add_argument('--input-csv', help='Single CSV containing a text column (default: message)')
|
||||||
|
src.add_argument('--posts-csv', help='Posts CSV to include')
|
||||||
|
parser.add_argument('--replies-csv', help='Replies CSV to include (combined with posts if provided)')
|
||||||
|
parser.add_argument('--text-col', default='message', help='Text column name in input CSV(s)')
|
||||||
|
parser.add_argument('-o', '--output', default='data/labeled_sentiment.csv', help='Output labeled CSV path')
|
||||||
|
parser.add_argument('--limit', type=int, default=None, help='Optional cap on number of rows')
|
||||||
|
parser.add_argument('--min-length', type=int, default=3, help='Minimum text length to consider')
|
||||||
|
|
||||||
|
parser.add_argument('--backend', choices=['vader', 'transformers', 'gpt'], default='vader', help='Labeling backend: vader, transformers, or gpt (local via Ollama)')
|
||||||
|
# VADER knobs
|
||||||
|
parser.add_argument('--vader-pos', type=float, default=0.05, help='VADER positive threshold (compound >=)')
|
||||||
|
parser.add_argument('--vader-neg', type=float, default=-0.05, help='VADER negative threshold (compound <=)')
|
||||||
|
parser.add_argument('--vader-margin', type=float, default=0.2, help='Confidence scaling for VADER distance')
|
||||||
|
# Transformers knobs
|
||||||
|
parser.add_argument('--transformers-model', default='cardiffnlp/twitter-roberta-base-sentiment-latest', help='HF model for 3-class sentiment')
|
||||||
|
parser.add_argument('--batch-size', type=int, default=64)
|
||||||
|
parser.add_argument('--min-prob', type=float, default=0.6, help='Min top class probability to accept')
|
||||||
|
parser.add_argument('--min-margin', type=float, default=0.2, help='Min prob gap between top-1 and top-2 to accept')
|
||||||
|
|
||||||
|
# GPT knobs
|
||||||
|
parser.add_argument('--gpt-model', default='llama3', help='Local GPT model name (Ollama)')
|
||||||
|
parser.add_argument('--gpt-base-url', default='http://localhost:11434', help='Base URL for local GPT server (Ollama)')
|
||||||
|
parser.add_argument('--gpt-batch-size', type=int, default=8)
|
||||||
|
|
||||||
|
parser.add_argument('--label-format', choices=['str', 'int'], default='str', help="Output labels as strings ('neg/neu/pos') or integers (0/1/2)")
|
||||||
|
parser.add_argument('--only-confident', action='store_true', help='Drop rows that do not meet confidence thresholds')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Load inputs
|
||||||
|
if args.input_csv:
|
||||||
|
if not os.path.exists(args.input_csv):
|
||||||
|
raise SystemExit(f"Input CSV not found: {args.input_csv}")
|
||||||
|
df = pd.read_csv(args.input_csv)
|
||||||
|
if args.text_col not in df.columns:
|
||||||
|
raise SystemExit(f"Text column '{args.text_col}' not in {args.input_csv}")
|
||||||
|
df = df.copy()
|
||||||
|
df['message'] = df[args.text_col].astype(str)
|
||||||
|
base_cols = [c for c in ['id', 'date', 'message', 'url'] if c in df.columns]
|
||||||
|
df = df[base_cols if base_cols else ['message']]
|
||||||
|
df = df[df['message'].str.len() >= args.min_length]
|
||||||
|
df = df.drop_duplicates(subset=['message']).reset_index(drop=True)
|
||||||
|
else:
|
||||||
|
df = _combine_inputs(args.posts_csv, args.replies_csv, text_col=args.text_col, min_length=args.min_length)
|
||||||
|
|
||||||
|
if args.limit and len(df) > args.limit:
|
||||||
|
df = df.head(args.limit)
|
||||||
|
|
||||||
|
texts = df['message'].astype(str).tolist()
|
||||||
|
|
||||||
|
# Predict labels + confidence
|
||||||
|
if args.backend == 'vader':
|
||||||
|
labels, conf = _auto_label_vader(texts, pos_th=args.vader_pos, neg_th=args.vader_neg, min_margin=args.vader_margin)
|
||||||
|
# For VADER, define acceptance: confident if outside neutral band by at least margin, or inside band with closeness to 0 below threshold
|
||||||
|
accept = []
|
||||||
|
analyzer = SentimentIntensityAnalyzer()
|
||||||
|
for t in texts:
|
||||||
|
comp = analyzer.polarity_scores(t or '').get('compound')
|
||||||
|
if comp is None:
|
||||||
|
accept.append(False)
|
||||||
|
continue
|
||||||
|
comp = float(comp)
|
||||||
|
if comp >= args.vader_pos + args.vader_margin or comp <= args.vader_neg - args.vader_margin:
|
||||||
|
accept.append(True)
|
||||||
|
else:
|
||||||
|
# inside or near band -> consider less confident
|
||||||
|
accept.append(False)
|
||||||
|
elif args.backend == 'transformers':
|
||||||
|
labels, conf = _auto_label_transformers(texts, args.transformers_model, args.batch_size, args.min_prob, args.min_margin)
|
||||||
|
accept = [((c >= 1.0)) or ((c >= 0.5)) for c in conf] # normalize conf ~[0,1]; accept medium-high confidence
|
||||||
|
else:
|
||||||
|
# GPT backend via Ollama: expect label+confidence
|
||||||
|
try:
|
||||||
|
from .gpt_sentiment import GPTSentiment
|
||||||
|
except Exception:
|
||||||
|
from gpt_sentiment import GPTSentiment
|
||||||
|
clf = GPTSentiment(base_url=args.gpt_base_url, model=args.gpt_model)
|
||||||
|
labels, conf = clf.predict_label_conf_batch(texts, batch_size=args.gpt_batch_size)
|
||||||
|
# Accept medium-high confidence; simple threshold like transformers path
|
||||||
|
accept = [c >= 0.5 for c in conf]
|
||||||
|
|
||||||
|
out = df.copy()
|
||||||
|
out.insert(1, 'label', labels)
|
||||||
|
out['confidence'] = conf
|
||||||
|
|
||||||
|
if args.only_confident:
|
||||||
|
out = out[np.array(accept, dtype=bool)]
|
||||||
|
out = out.reset_index(drop=True)
|
||||||
|
|
||||||
|
if args.label_format == 'int':
|
||||||
|
out['label'] = _map_label_str_to_int(out['label'].astype(str).tolist())
|
||||||
|
|
||||||
|
os.makedirs(os.path.dirname(args.output) or '.', exist_ok=True)
|
||||||
|
out.to_csv(args.output, index=False)
|
||||||
|
kept = len(out)
|
||||||
|
print(f"Wrote {kept} labeled rows to {args.output} using backend={args.backend}")
|
||||||
|
if args.only_confident:
|
||||||
|
print("Note: only confident predictions were kept. You can remove --only-confident to include all rows.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
48
src/eval_sentiment.py
Normal file
48
src/eval_sentiment.py
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
import argparse
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
|
||||||
|
|
||||||
|
try:
|
||||||
|
from .transformer_sentiment import TransformerSentiment
|
||||||
|
except ImportError:
|
||||||
|
# Allow running as a script via -m src.eval_sentiment
|
||||||
|
from transformer_sentiment import TransformerSentiment
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Evaluate a fine-tuned transformers sentiment model on a labeled CSV')
|
||||||
|
parser.add_argument('--csv', required=True, help='Labeled CSV path with message and label columns')
|
||||||
|
parser.add_argument('--text-col', default='message')
|
||||||
|
parser.add_argument('--label-col', default='label')
|
||||||
|
parser.add_argument('--model', required=True, help='Model name or local path')
|
||||||
|
parser.add_argument('--batch-size', type=int, default=64)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
df = pd.read_csv(args.csv)
|
||||||
|
df = df[[args.text_col, args.label_col]].dropna().copy()
|
||||||
|
texts = df[args.text_col].astype(str).tolist()
|
||||||
|
true_labels = df[args.label_col].astype(str).tolist()
|
||||||
|
|
||||||
|
clf = TransformerSentiment(args.model)
|
||||||
|
_, pred_labels = clf.predict_probs_and_labels(texts, batch_size=args.batch_size)
|
||||||
|
|
||||||
|
y_true = np.array(true_labels)
|
||||||
|
y_pred = np.array(pred_labels)
|
||||||
|
|
||||||
|
# If labels differ from model id2label names, normalize to strings for comparison
|
||||||
|
acc = accuracy_score(y_true, y_pred)
|
||||||
|
f1_macro = f1_score(y_true, y_pred, average='macro', zero_division=0)
|
||||||
|
prec_macro = precision_score(y_true, y_pred, average='macro', zero_division=0)
|
||||||
|
rec_macro = recall_score(y_true, y_pred, average='macro', zero_division=0)
|
||||||
|
|
||||||
|
print('Accuracy:', f"{acc:.4f}")
|
||||||
|
print('F1 (macro):', f"{f1_macro:.4f}")
|
||||||
|
print('Precision (macro):', f"{prec_macro:.4f}")
|
||||||
|
print('Recall (macro):', f"{rec_macro:.4f}")
|
||||||
|
print('\nClassification report:')
|
||||||
|
print(classification_report(y_true, y_pred, zero_division=0))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
131
src/fetch_schedule.py
Normal file
131
src/fetch_schedule.py
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
import argparse
|
||||||
|
import csv
|
||||||
|
import os
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
API_BASE = "https://api.football-data.org/v4"
|
||||||
|
COMPETITION_CODE = "PL" # Premier League
|
||||||
|
|
||||||
|
|
||||||
|
def iso_date(d: str) -> str:
|
||||||
|
# Accept YYYY-MM-DD and return ISO date
|
||||||
|
try:
|
||||||
|
return datetime.fromisoformat(d).date().isoformat()
|
||||||
|
except Exception as e:
|
||||||
|
raise argparse.ArgumentTypeError(f"Invalid date: {d}. Use YYYY-MM-DD") from e
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_matches(start_date: str, end_date: str, token: str) -> Dict[str, Any]:
|
||||||
|
url = f"{API_BASE}/competitions/{COMPETITION_CODE}/matches"
|
||||||
|
headers = {"X-Auth-Token": token}
|
||||||
|
params = {
|
||||||
|
"dateFrom": start_date,
|
||||||
|
"dateTo": end_date,
|
||||||
|
}
|
||||||
|
r = requests.get(url, headers=headers, params=params, timeout=30)
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_match(m: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
utc_date = m.get("utcDate")
|
||||||
|
# Convert to date/time strings
|
||||||
|
kick_iso = None
|
||||||
|
if utc_date:
|
||||||
|
try:
|
||||||
|
kick_iso = datetime.fromisoformat(utc_date.replace("Z", "+00:00")).isoformat()
|
||||||
|
except Exception:
|
||||||
|
kick_iso = utc_date
|
||||||
|
score = m.get("score", {})
|
||||||
|
full_time = score.get("fullTime", {})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"id": m.get("id"),
|
||||||
|
"status": m.get("status"),
|
||||||
|
"matchday": m.get("matchday"),
|
||||||
|
"utcDate": kick_iso,
|
||||||
|
"homeTeam": (m.get("homeTeam") or {}).get("name"),
|
||||||
|
"awayTeam": (m.get("awayTeam") or {}).get("name"),
|
||||||
|
"homeScore": full_time.get("home"),
|
||||||
|
"awayScore": full_time.get("away"),
|
||||||
|
"referees": ", ".join([r.get("name", "") for r in m.get("referees", []) if r.get("name")]),
|
||||||
|
"venue": m.get("area", {}).get("name"),
|
||||||
|
"competition": (m.get("competition") or {}).get("name"),
|
||||||
|
"stage": m.get("stage"),
|
||||||
|
"group": m.get("group"),
|
||||||
|
"link": m.get("id") and f"https://www.football-data.org/match/{m['id']}" or None,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def save_csv(matches: List[Dict[str, Any]], out_path: str) -> None:
|
||||||
|
if not matches:
|
||||||
|
# Write header only
|
||||||
|
fields = [
|
||||||
|
"id",
|
||||||
|
"status",
|
||||||
|
"matchday",
|
||||||
|
"utcDate",
|
||||||
|
"homeTeam",
|
||||||
|
"awayTeam",
|
||||||
|
"homeScore",
|
||||||
|
"awayScore",
|
||||||
|
"referees",
|
||||||
|
"venue",
|
||||||
|
"competition",
|
||||||
|
"stage",
|
||||||
|
"group",
|
||||||
|
"link",
|
||||||
|
]
|
||||||
|
with open(out_path, "w", newline="", encoding="utf-8") as f:
|
||||||
|
writer = csv.DictWriter(f, fieldnames=fields)
|
||||||
|
writer.writeheader()
|
||||||
|
return
|
||||||
|
fields = list(matches[0].keys())
|
||||||
|
with open(out_path, "w", newline="", encoding="utf-8") as f:
|
||||||
|
writer = csv.DictWriter(f, fieldnames=fields)
|
||||||
|
writer.writeheader()
|
||||||
|
writer.writerows(matches)
|
||||||
|
|
||||||
|
|
||||||
|
def save_json(matches: List[Dict[str, Any]], out_path: str) -> None:
|
||||||
|
import json
|
||||||
|
|
||||||
|
with open(out_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(matches, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Fetch Premier League fixtures in a date range and save to CSV/JSON")
|
||||||
|
parser.add_argument("--start-date", required=True, type=iso_date, help="YYYY-MM-DD (inclusive)")
|
||||||
|
parser.add_argument("--end-date", required=True, type=iso_date, help="YYYY-MM-DD (inclusive)")
|
||||||
|
parser.add_argument("-o", "--output", required=True, help="Output file path (.csv or .json)")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
token = os.getenv("FOOTBALL_DATA_API_TOKEN")
|
||||||
|
if not token:
|
||||||
|
raise SystemExit("Missing FOOTBALL_DATA_API_TOKEN in environment (.env)")
|
||||||
|
|
||||||
|
data = fetch_matches(args.start_date, args.end_date, token)
|
||||||
|
matches_raw = data.get("matches", [])
|
||||||
|
matches = [normalize_match(m) for m in matches_raw]
|
||||||
|
|
||||||
|
os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
|
||||||
|
|
||||||
|
ext = os.path.splitext(args.output)[1].lower()
|
||||||
|
if ext == ".csv":
|
||||||
|
save_csv(matches, args.output)
|
||||||
|
elif ext == ".json":
|
||||||
|
save_json(matches, args.output)
|
||||||
|
else:
|
||||||
|
raise SystemExit("Output must end with .csv or .json")
|
||||||
|
|
||||||
|
print(f"Saved {len(matches)} matches to {args.output}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
93
src/gpt_sentiment.py
Normal file
93
src/gpt_sentiment.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
import json
|
||||||
|
from typing import List, Tuple
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
|
||||||
|
class GPTSentiment:
|
||||||
|
"""
|
||||||
|
Minimal client for a local GPT model served by Ollama.
|
||||||
|
|
||||||
|
Expects the model to respond with a strict JSON object like:
|
||||||
|
{"label": "neg|neu|pos", "confidence": 0.0..1.0}
|
||||||
|
|
||||||
|
Endpoint used: POST {base_url}/api/generate with payload:
|
||||||
|
{"model": <model>, "prompt": <prompt>, "stream": false, "format": "json"}
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, base_url: str = "http://localhost:11434", model: str = "llama3", timeout: int = 30):
|
||||||
|
self.base_url = base_url.rstrip("/")
|
||||||
|
self.model = model
|
||||||
|
self.timeout = timeout
|
||||||
|
|
||||||
|
def _build_prompt(self, text: str) -> str:
|
||||||
|
# Keep the instruction terse and deterministic; request strict JSON.
|
||||||
|
return (
|
||||||
|
"You are a strict JSON generator for sentiment analysis. "
|
||||||
|
"Classify the INPUT text as one of: neg, neu, pos. "
|
||||||
|
"Return ONLY a JSON object with keys 'label' and 'confidence' (0..1). "
|
||||||
|
"No markdown, no prose.\n\n"
|
||||||
|
f"INPUT: {text}"
|
||||||
|
)
|
||||||
|
|
||||||
|
def _call(self, prompt: str) -> dict:
|
||||||
|
url = f"{self.base_url}/api/generate"
|
||||||
|
payload = {
|
||||||
|
"model": self.model,
|
||||||
|
"prompt": prompt,
|
||||||
|
"stream": False,
|
||||||
|
"format": "json",
|
||||||
|
}
|
||||||
|
r = requests.post(url, json=payload, timeout=self.timeout)
|
||||||
|
r.raise_for_status()
|
||||||
|
data = r.json()
|
||||||
|
# Ollama returns the model's response under 'response'
|
||||||
|
raw = data.get("response", "").strip()
|
||||||
|
try:
|
||||||
|
obj = json.loads(raw)
|
||||||
|
except Exception:
|
||||||
|
# Try to recover simple cases by stripping codefences
|
||||||
|
raw2 = raw.strip().removeprefix("```").removesuffix("```")
|
||||||
|
obj = json.loads(raw2)
|
||||||
|
return obj
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _canonical_label(s: str) -> str:
|
||||||
|
s = (s or "").strip().lower()
|
||||||
|
if "neg" in s:
|
||||||
|
return "neg"
|
||||||
|
if "neu" in s or "neutral" in s:
|
||||||
|
return "neu"
|
||||||
|
if "pos" in s or "positive" in s:
|
||||||
|
return "pos"
|
||||||
|
return s or "neu"
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _compound_from_label_conf(label: str, confidence: float) -> float:
|
||||||
|
label = GPTSentiment._canonical_label(label)
|
||||||
|
c = max(0.0, min(1.0, float(confidence or 0.0)))
|
||||||
|
if label == "pos":
|
||||||
|
return c
|
||||||
|
if label == "neg":
|
||||||
|
return -c
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
def predict_label_conf_batch(self, texts: List[str], batch_size: int = 8) -> Tuple[List[str], List[float]]:
|
||||||
|
labels: List[str] = []
|
||||||
|
confs: List[float] = []
|
||||||
|
for i in range(0, len(texts), batch_size):
|
||||||
|
batch = texts[i:i+batch_size]
|
||||||
|
for t in batch:
|
||||||
|
try:
|
||||||
|
obj = self._call(self._build_prompt(t))
|
||||||
|
lab = self._canonical_label(obj.get("label", ""))
|
||||||
|
conf = float(obj.get("confidence", 0.0))
|
||||||
|
except Exception:
|
||||||
|
lab, conf = "neu", 0.0
|
||||||
|
labels.append(lab)
|
||||||
|
confs.append(conf)
|
||||||
|
return labels, confs
|
||||||
|
|
||||||
|
def predict_compound_batch(self, texts: List[str], batch_size: int = 8) -> List[float]:
|
||||||
|
labels, confs = self.predict_label_conf_batch(texts, batch_size=batch_size)
|
||||||
|
return [self._compound_from_label_conf(lab, conf) for lab, conf in zip(labels, confs)]
|
||||||
65
src/make_labeling_set.py
Normal file
65
src/make_labeling_set.py
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
|
||||||
|
def load_messages(csv_path: str, text_col: str = 'message', extra_cols=None) -> pd.DataFrame:
|
||||||
|
if not os.path.exists(csv_path):
|
||||||
|
return pd.DataFrame()
|
||||||
|
df = pd.read_csv(csv_path)
|
||||||
|
if text_col not in df.columns:
|
||||||
|
return pd.DataFrame()
|
||||||
|
cols = ['id', text_col, 'date']
|
||||||
|
if extra_cols:
|
||||||
|
for c in extra_cols:
|
||||||
|
if c in df.columns:
|
||||||
|
cols.append(c)
|
||||||
|
cols = [c for c in cols if c in df.columns]
|
||||||
|
out = df[cols].copy()
|
||||||
|
out.rename(columns={text_col: 'message'}, inplace=True)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Create a labeling CSV from posts and/or replies.')
|
||||||
|
parser.add_argument('--posts-csv', required=False, help='Posts CSV path (e.g., data/..._update.csv)')
|
||||||
|
parser.add_argument('--replies-csv', required=False, help='Replies CSV path')
|
||||||
|
parser.add_argument('-o', '--output', default='data/labeled_sentiment.csv', help='Output CSV for labeling')
|
||||||
|
parser.add_argument('--sample-size', type=int, default=1000, help='Total rows to include (after combining)')
|
||||||
|
parser.add_argument('--min-length', type=int, default=3, help='Minimum message length to include')
|
||||||
|
parser.add_argument('--shuffle', action='store_true', help='Shuffle before sampling (default true)')
|
||||||
|
parser.add_argument('--no-shuffle', dest='shuffle', action='store_false')
|
||||||
|
parser.set_defaults(shuffle=True)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
frames = []
|
||||||
|
if args.posts_csv:
|
||||||
|
frames.append(load_messages(args.posts_csv))
|
||||||
|
if args.replies_csv:
|
||||||
|
# For replies, include parent_id if present
|
||||||
|
r = load_messages(args.replies_csv, extra_cols=['parent_id'])
|
||||||
|
frames.append(r)
|
||||||
|
if not frames:
|
||||||
|
raise SystemExit('No input CSVs provided. Use --posts-csv and/or --replies-csv.')
|
||||||
|
|
||||||
|
df = pd.concat(frames, ignore_index=True)
|
||||||
|
# Basic filtering: non-empty text, min length, drop duplicates by message text
|
||||||
|
df['message'] = df['message'].fillna('').astype(str)
|
||||||
|
df = df[df['message'].str.len() >= args.min_length]
|
||||||
|
df = df.drop_duplicates(subset=['message']).reset_index(drop=True)
|
||||||
|
|
||||||
|
if args.shuffle:
|
||||||
|
df = df.sample(frac=1.0, random_state=42).reset_index(drop=True)
|
||||||
|
if args.sample_size and len(df) > args.sample_size:
|
||||||
|
df = df.head(args.sample_size)
|
||||||
|
|
||||||
|
# Add blank label column for human annotation
|
||||||
|
df.insert(1, 'label', '')
|
||||||
|
|
||||||
|
os.makedirs(os.path.dirname(args.output) or '.', exist_ok=True)
|
||||||
|
df.to_csv(args.output, index=False)
|
||||||
|
print(f"Wrote labeling CSV with {len(df)} rows to {args.output}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
137
src/plot_labeled.py
Normal file
137
src/plot_labeled.py
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
|
||||||
|
def safe_read(path: str) -> pd.DataFrame:
|
||||||
|
if not os.path.exists(path):
|
||||||
|
raise SystemExit(f"Input labeled CSV not found: {path}")
|
||||||
|
df = pd.read_csv(path)
|
||||||
|
if 'label' not in df.columns:
|
||||||
|
raise SystemExit("Expected a 'label' column in the labeled CSV")
|
||||||
|
if 'message' in df.columns:
|
||||||
|
df['message'] = df['message'].fillna('').astype(str)
|
||||||
|
if 'confidence' in df.columns:
|
||||||
|
df['confidence'] = pd.to_numeric(df['confidence'], errors='coerce')
|
||||||
|
if 'date' in df.columns:
|
||||||
|
df['date'] = pd.to_datetime(df['date'], errors='coerce')
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_out_dir(out_dir: str) -> str:
|
||||||
|
os.makedirs(out_dir, exist_ok=True)
|
||||||
|
return out_dir
|
||||||
|
|
||||||
|
|
||||||
|
def plot_all(df: pd.DataFrame, out_dir: str) -> None:
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import seaborn as sns
|
||||||
|
sns.set_style('whitegrid')
|
||||||
|
|
||||||
|
out_dir = ensure_out_dir(out_dir)
|
||||||
|
|
||||||
|
# 1) Class distribution
|
||||||
|
try:
|
||||||
|
plt.figure(figsize=(6,4))
|
||||||
|
ax = (df['label'].astype(str).str.lower().value_counts()
|
||||||
|
.reindex(['neg','neu','pos'])
|
||||||
|
.fillna(0)
|
||||||
|
.rename_axis('label').reset_index(name='count')
|
||||||
|
.set_index('label')
|
||||||
|
.plot(kind='bar', legend=False, color=['#d62728','#aaaaaa','#2ca02c']))
|
||||||
|
plt.title('Labeled class distribution')
|
||||||
|
plt.ylabel('Count')
|
||||||
|
plt.tight_layout()
|
||||||
|
path = os.path.join(out_dir, 'labeled_class_distribution.png')
|
||||||
|
plt.savefig(path, dpi=150)
|
||||||
|
plt.close()
|
||||||
|
print(f"[plots] Saved {path}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[plots] Skipped class distribution: {e}")
|
||||||
|
|
||||||
|
# 2) Confidence histogram (overall)
|
||||||
|
if 'confidence' in df.columns and df['confidence'].notna().any():
|
||||||
|
try:
|
||||||
|
plt.figure(figsize=(6,4))
|
||||||
|
sns.histplot(df['confidence'].dropna(), bins=30, color='#1f77b4')
|
||||||
|
plt.title('Confidence distribution (overall)')
|
||||||
|
plt.xlabel('Confidence'); plt.ylabel('Frequency')
|
||||||
|
plt.tight_layout()
|
||||||
|
path = os.path.join(out_dir, 'labeled_confidence_hist.png')
|
||||||
|
plt.savefig(path, dpi=150); plt.close()
|
||||||
|
print(f"[plots] Saved {path}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[plots] Skipped confidence histogram: {e}")
|
||||||
|
|
||||||
|
# 3) Confidence by label (boxplot)
|
||||||
|
try:
|
||||||
|
plt.figure(figsize=(6,4))
|
||||||
|
t = df[['label','confidence']].dropna()
|
||||||
|
t['label'] = t['label'].astype(str).str.lower()
|
||||||
|
order = ['neg','neu','pos']
|
||||||
|
sns.boxplot(data=t, x='label', y='confidence', order=order, palette=['#d62728','#aaaaaa','#2ca02c'])
|
||||||
|
plt.title('Confidence by label')
|
||||||
|
plt.xlabel('Label'); plt.ylabel('Confidence')
|
||||||
|
plt.tight_layout()
|
||||||
|
path = os.path.join(out_dir, 'labeled_confidence_by_label.png')
|
||||||
|
plt.savefig(path, dpi=150); plt.close()
|
||||||
|
print(f"[plots] Saved {path}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[plots] Skipped confidence by label: {e}")
|
||||||
|
|
||||||
|
# 4) Message length by label
|
||||||
|
if 'message' in df.columns:
|
||||||
|
try:
|
||||||
|
t = df[['label','message']].copy()
|
||||||
|
t['label'] = t['label'].astype(str).str.lower()
|
||||||
|
t['len'] = t['message'].astype(str).str.len()
|
||||||
|
plt.figure(figsize=(6,4))
|
||||||
|
sns.boxplot(data=t, x='label', y='len', order=['neg','neu','pos'], palette=['#d62728','#aaaaaa','#2ca02c'])
|
||||||
|
plt.title('Message length by label')
|
||||||
|
plt.xlabel('Label'); plt.ylabel('Length (chars)')
|
||||||
|
plt.tight_layout()
|
||||||
|
path = os.path.join(out_dir, 'labeled_length_by_label.png')
|
||||||
|
plt.savefig(path, dpi=150); plt.close()
|
||||||
|
print(f"[plots] Saved {path}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[plots] Skipped length by label: {e}")
|
||||||
|
|
||||||
|
# 5) Daily counts per label (if date present)
|
||||||
|
if 'date' in df.columns and df['date'].notna().any():
|
||||||
|
try:
|
||||||
|
t = df[['date','label']].dropna().copy()
|
||||||
|
t['day'] = pd.to_datetime(t['date'], errors='coerce').dt.date
|
||||||
|
t['label'] = t['label'].astype(str).str.lower()
|
||||||
|
pv = t.pivot_table(index='day', columns='label', values='date', aggfunc='count').fillna(0)
|
||||||
|
# ensure consistent column order
|
||||||
|
for c in ['neg','neu','pos']:
|
||||||
|
if c not in pv.columns:
|
||||||
|
pv[c] = 0
|
||||||
|
pv = pv[['neg','neu','pos']]
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
plt.figure(figsize=(10,4))
|
||||||
|
pv.plot(kind='bar', stacked=True, color=['#d62728','#aaaaaa','#2ca02c'])
|
||||||
|
plt.title('Daily labeled counts (stacked)')
|
||||||
|
plt.xlabel('Day'); plt.ylabel('Count')
|
||||||
|
plt.tight_layout()
|
||||||
|
path = os.path.join(out_dir, 'labeled_daily_counts.png')
|
||||||
|
plt.savefig(path, dpi=150); plt.close()
|
||||||
|
print(f"[plots] Saved {path}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[plots] Skipped daily counts: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Plot graphs from labeled sentiment data.')
|
||||||
|
parser.add_argument('-i', '--input', default='data/labeled_sentiment.csv', help='Path to labeled CSV')
|
||||||
|
parser.add_argument('-o', '--out-dir', default='data', help='Output directory for plots')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
df = safe_read(args.input)
|
||||||
|
plot_all(df, args.out_dir)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
749
src/telegram_scraper.py
Normal file
749
src/telegram_scraper.py
Normal file
@@ -0,0 +1,749 @@
|
|||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
from dataclasses import asdict, dataclass
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import AsyncIterator, Iterable, Optional, Sequence, Set, List, Tuple
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from telethon import TelegramClient
|
||||||
|
from telethon.errors import SessionPasswordNeededError
|
||||||
|
from telethon.errors.rpcerrorlist import MsgIdInvalidError, FloodWaitError
|
||||||
|
from telethon.tl.functions.messages import GetDiscussionMessageRequest
|
||||||
|
from telethon.tl.custom.message import Message
|
||||||
|
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapedMessage:
|
||||||
|
id: int
|
||||||
|
date: Optional[str] # ISO format
|
||||||
|
message: Optional[str]
|
||||||
|
sender_id: Optional[int]
|
||||||
|
views: Optional[int]
|
||||||
|
forwards: Optional[int]
|
||||||
|
replies: Optional[int]
|
||||||
|
url: Optional[str]
|
||||||
|
|
||||||
|
|
||||||
|
def to_iso(dt: datetime) -> str:
|
||||||
|
return dt.replace(tzinfo=None).isoformat()
|
||||||
|
|
||||||
|
|
||||||
|
async def iter_messages(
|
||||||
|
client: TelegramClient,
|
||||||
|
entity: str,
|
||||||
|
limit: Optional[int] = None,
|
||||||
|
offset_date: Optional[datetime] = None,
|
||||||
|
) -> AsyncIterator[Message]:
|
||||||
|
async for msg in client.iter_messages(entity, limit=limit, offset_date=offset_date):
|
||||||
|
yield msg
|
||||||
|
|
||||||
|
|
||||||
|
def message_to_record(msg: Message, channel_username: str) -> ScrapedMessage:
|
||||||
|
return ScrapedMessage(
|
||||||
|
id=msg.id,
|
||||||
|
date=to_iso(msg.date) if msg.date else None,
|
||||||
|
message=msg.message,
|
||||||
|
sender_id=getattr(msg.sender_id, 'value', msg.sender_id) if hasattr(msg, 'sender_id') else None,
|
||||||
|
views=getattr(msg, 'views', None),
|
||||||
|
forwards=getattr(msg, 'forwards', None),
|
||||||
|
replies=(msg.replies.replies if getattr(msg, 'replies', None) else None),
|
||||||
|
url=f"https://t.me/{channel_username}/{msg.id}" if channel_username else None,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def ensure_login(client: TelegramClient, phone: Optional[str] = None, twofa_password: Optional[str] = None):
|
||||||
|
# Connect and log in, prompting interactively if needed
|
||||||
|
await client.connect()
|
||||||
|
if not await client.is_user_authorized():
|
||||||
|
if not phone:
|
||||||
|
phone = input("Enter your phone number (with country code): ")
|
||||||
|
await client.send_code_request(phone)
|
||||||
|
code = input("Enter the login code you received: ")
|
||||||
|
try:
|
||||||
|
await client.sign_in(phone=phone, code=code)
|
||||||
|
except SessionPasswordNeededError:
|
||||||
|
if twofa_password is None:
|
||||||
|
twofa_password = input("Two-step verification enabled. Enter your password: ")
|
||||||
|
await client.sign_in(password=twofa_password)
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_channel(
|
||||||
|
channel: str,
|
||||||
|
output: str,
|
||||||
|
limit: Optional[int] = None,
|
||||||
|
offset_date: Optional[str] = None, # deprecated in favor of start_date
|
||||||
|
start_date: Optional[str] = None,
|
||||||
|
end_date: Optional[str] = None,
|
||||||
|
append: bool = False,
|
||||||
|
session_name: str = "telegram",
|
||||||
|
phone: Optional[str] = None,
|
||||||
|
twofa_password: Optional[str] = None,
|
||||||
|
):
|
||||||
|
load_dotenv()
|
||||||
|
api_id = os.getenv("TELEGRAM_API_ID")
|
||||||
|
api_hash = os.getenv("TELEGRAM_API_HASH")
|
||||||
|
session_name = os.getenv("TELEGRAM_SESSION_NAME", session_name)
|
||||||
|
|
||||||
|
if not api_id or not api_hash:
|
||||||
|
raise RuntimeError("Missing TELEGRAM_API_ID/TELEGRAM_API_HASH in environment. See .env.example")
|
||||||
|
|
||||||
|
# Some providers store api_id as string; Telethon expects int
|
||||||
|
try:
|
||||||
|
api_id_int = int(api_id)
|
||||||
|
except Exception as e:
|
||||||
|
raise RuntimeError("TELEGRAM_API_ID must be an integer") from e
|
||||||
|
|
||||||
|
client = TelegramClient(session_name, api_id_int, api_hash)
|
||||||
|
|
||||||
|
# Parse date filters
|
||||||
|
parsed_start = None
|
||||||
|
parsed_end = None
|
||||||
|
if start_date:
|
||||||
|
parsed_start = datetime.fromisoformat(start_date)
|
||||||
|
elif offset_date: # backward compatibility
|
||||||
|
parsed_start = datetime.fromisoformat(offset_date)
|
||||||
|
if end_date:
|
||||||
|
parsed_end = datetime.fromisoformat(end_date)
|
||||||
|
|
||||||
|
await ensure_login(client, phone=phone, twofa_password=twofa_password)
|
||||||
|
|
||||||
|
# Determine output format based on extension
|
||||||
|
ext = os.path.splitext(output)[1].lower()
|
||||||
|
is_jsonl = ext in (".jsonl", ".ndjson")
|
||||||
|
is_csv = ext == ".csv"
|
||||||
|
|
||||||
|
if not (is_jsonl or is_csv):
|
||||||
|
raise ValueError("Output file must end with .jsonl or .csv")
|
||||||
|
|
||||||
|
# Prepare output writers
|
||||||
|
csv_file = None
|
||||||
|
csv_writer = None
|
||||||
|
jsonl_file = None
|
||||||
|
if is_csv:
|
||||||
|
import csv
|
||||||
|
mode = "a" if append else "w"
|
||||||
|
csv_file = open(output, mode, newline="", encoding="utf-8")
|
||||||
|
csv_writer = csv.DictWriter(
|
||||||
|
csv_file,
|
||||||
|
fieldnames=[
|
||||||
|
"id",
|
||||||
|
"date",
|
||||||
|
"message",
|
||||||
|
"sender_id",
|
||||||
|
"views",
|
||||||
|
"forwards",
|
||||||
|
"replies",
|
||||||
|
"url",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
# Write header if not appending, or file is empty
|
||||||
|
need_header = True
|
||||||
|
try:
|
||||||
|
if append and os.path.exists(output) and os.path.getsize(output) > 0:
|
||||||
|
need_header = False
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
if need_header:
|
||||||
|
csv_writer.writeheader()
|
||||||
|
elif is_jsonl:
|
||||||
|
# Open once; append or overwrite
|
||||||
|
mode = "a" if append else "w"
|
||||||
|
jsonl_file = open(output, mode, encoding="utf-8")
|
||||||
|
|
||||||
|
written = 0
|
||||||
|
try:
|
||||||
|
async for msg in iter_messages(client, channel, limit=None, offset_date=None):
|
||||||
|
# Telethon returns tz-aware datetimes; normalize for comparison
|
||||||
|
msg_dt = msg.date
|
||||||
|
if msg_dt is not None:
|
||||||
|
msg_dt = msg_dt.replace(tzinfo=None)
|
||||||
|
|
||||||
|
# Date range filter: include if within [parsed_start, parsed_end] (inclusive)
|
||||||
|
if parsed_start and msg_dt and msg_dt < parsed_start:
|
||||||
|
# Since we're iterating newest-first, once older than start we can stop
|
||||||
|
break
|
||||||
|
if parsed_end and msg_dt and msg_dt > parsed_end:
|
||||||
|
continue
|
||||||
|
|
||||||
|
rec = message_to_record(msg, channel_username=channel.lstrip("@"))
|
||||||
|
if is_jsonl and jsonl_file is not None:
|
||||||
|
jsonl_file.write(json.dumps(asdict(rec), ensure_ascii=False) + "\n")
|
||||||
|
else:
|
||||||
|
csv_writer.writerow(asdict(rec)) # type: ignore
|
||||||
|
written += 1
|
||||||
|
if limit is not None and written >= limit:
|
||||||
|
break
|
||||||
|
finally:
|
||||||
|
if csv_file:
|
||||||
|
csv_file.close()
|
||||||
|
if jsonl_file:
|
||||||
|
jsonl_file.close()
|
||||||
|
await client.disconnect()
|
||||||
|
|
||||||
|
return written
|
||||||
|
|
||||||
|
|
||||||
|
async def fetch_replies(
|
||||||
|
channel: str,
|
||||||
|
parent_ids: Sequence[int],
|
||||||
|
output_csv: str,
|
||||||
|
append: bool = False,
|
||||||
|
session_name: str = "telegram",
|
||||||
|
phone: Optional[str] = None,
|
||||||
|
twofa_password: Optional[str] = None,
|
||||||
|
concurrency: int = 5,
|
||||||
|
existing_pairs: Optional[Set[Tuple[int, int]]] = None,
|
||||||
|
):
|
||||||
|
load_dotenv()
|
||||||
|
api_id = os.getenv("TELEGRAM_API_ID")
|
||||||
|
api_hash = os.getenv("TELEGRAM_API_HASH")
|
||||||
|
session_name = os.getenv("TELEGRAM_SESSION_NAME", session_name)
|
||||||
|
|
||||||
|
if not api_id or not api_hash:
|
||||||
|
raise RuntimeError("Missing TELEGRAM_API_ID/TELEGRAM_API_HASH in environment. See .env.example")
|
||||||
|
client = TelegramClient(session_name, int(api_id), api_hash)
|
||||||
|
await ensure_login(client, phone=phone, twofa_password=twofa_password)
|
||||||
|
|
||||||
|
import csv
|
||||||
|
|
||||||
|
# Rate limiting counters
|
||||||
|
flood_hits = 0
|
||||||
|
flood_wait_seconds = 0
|
||||||
|
|
||||||
|
analyzer = SentimentIntensityAnalyzer()
|
||||||
|
os.makedirs(os.path.dirname(output_csv) or ".", exist_ok=True)
|
||||||
|
mode = "a" if append else "w"
|
||||||
|
with open(output_csv, mode, newline="", encoding="utf-8") as f:
|
||||||
|
writer = csv.DictWriter(
|
||||||
|
f,
|
||||||
|
fieldnames=["parent_id", "id", "date", "message", "sender_id", "sentiment_compound", "url"],
|
||||||
|
)
|
||||||
|
# Write header only if not appending or file empty
|
||||||
|
need_header = True
|
||||||
|
try:
|
||||||
|
if append and os.path.exists(output_csv) and os.path.getsize(output_csv) > 0:
|
||||||
|
need_header = False
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
if need_header:
|
||||||
|
writer.writeheader()
|
||||||
|
|
||||||
|
write_lock = asyncio.Lock()
|
||||||
|
sem = asyncio.Semaphore(max(1, int(concurrency)))
|
||||||
|
|
||||||
|
async def handle_parent(pid: int) -> List[dict]:
|
||||||
|
rows: List[dict] = []
|
||||||
|
# First try replies within the same channel (works for groups/supergroups)
|
||||||
|
attempts = 0
|
||||||
|
while attempts < 3:
|
||||||
|
try:
|
||||||
|
async for reply in client.iter_messages(channel, reply_to=pid):
|
||||||
|
dt = reply.date.replace(tzinfo=None) if reply.date else None
|
||||||
|
url = f"https://t.me/{channel.lstrip('@')}/{reply.id}" if reply.id else None
|
||||||
|
text = reply.message or ""
|
||||||
|
sent = analyzer.polarity_scores(text).get("compound")
|
||||||
|
rows.append(
|
||||||
|
{
|
||||||
|
"parent_id": pid,
|
||||||
|
"id": reply.id,
|
||||||
|
"date": to_iso(dt) if dt else None,
|
||||||
|
"message": text,
|
||||||
|
"sender_id": getattr(reply, "sender_id", None),
|
||||||
|
"sentiment_compound": sent,
|
||||||
|
"url": url,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
break
|
||||||
|
except FloodWaitError as e:
|
||||||
|
secs = int(getattr(e, 'seconds', 5))
|
||||||
|
flood_hits += 1
|
||||||
|
flood_wait_seconds += secs
|
||||||
|
print(f"[rate-limit] FloodWait while scanning replies in-channel for parent {pid}; waiting {secs}s", flush=True)
|
||||||
|
await asyncio.sleep(secs + 1)
|
||||||
|
attempts += 1
|
||||||
|
continue
|
||||||
|
except MsgIdInvalidError:
|
||||||
|
# Likely a channel with a linked discussion group; fall back below
|
||||||
|
rows.clear()
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
|
||||||
|
if rows:
|
||||||
|
return rows
|
||||||
|
|
||||||
|
# Fallback: for channels with comments in a linked discussion group
|
||||||
|
try:
|
||||||
|
res = await client(GetDiscussionMessageRequest(peer=channel, msg_id=pid))
|
||||||
|
except Exception:
|
||||||
|
# No discussion thread found or not accessible
|
||||||
|
return rows
|
||||||
|
|
||||||
|
# Identify the discussion chat and the root message id in that chat
|
||||||
|
disc_chat = None
|
||||||
|
if getattr(res, "chats", None):
|
||||||
|
# Prefer the first chat returned as the discussion chat
|
||||||
|
disc_chat = res.chats[0]
|
||||||
|
|
||||||
|
disc_root_id = None
|
||||||
|
for m in getattr(res, "messages", []) or []:
|
||||||
|
try:
|
||||||
|
peer_id = getattr(m, "peer_id", None)
|
||||||
|
if not peer_id or not disc_chat:
|
||||||
|
continue
|
||||||
|
ch_id = getattr(peer_id, "channel_id", None) or getattr(peer_id, "chat_id", None)
|
||||||
|
if ch_id == getattr(disc_chat, "id", None):
|
||||||
|
disc_root_id = m.id
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not disc_chat or not disc_root_id:
|
||||||
|
return rows
|
||||||
|
|
||||||
|
group_username = getattr(disc_chat, "username", None)
|
||||||
|
attempts = 0
|
||||||
|
while attempts < 3:
|
||||||
|
try:
|
||||||
|
async for reply in client.iter_messages(disc_chat, reply_to=disc_root_id):
|
||||||
|
dt = reply.date.replace(tzinfo=None) if reply.date else None
|
||||||
|
text = reply.message or ""
|
||||||
|
sent = analyzer.polarity_scores(text).get("compound")
|
||||||
|
# Construct URL only if the discussion group has a public username
|
||||||
|
url = None
|
||||||
|
if group_username and reply.id:
|
||||||
|
url = f"https://t.me/{group_username}/{reply.id}"
|
||||||
|
rows.append(
|
||||||
|
{
|
||||||
|
"parent_id": pid,
|
||||||
|
"id": reply.id,
|
||||||
|
"date": to_iso(dt) if dt else None,
|
||||||
|
"message": text,
|
||||||
|
"sender_id": getattr(reply, "sender_id", None),
|
||||||
|
"sentiment_compound": sent,
|
||||||
|
"url": url,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
break
|
||||||
|
except FloodWaitError as e:
|
||||||
|
secs = int(getattr(e, 'seconds', 5))
|
||||||
|
flood_hits += 1
|
||||||
|
flood_wait_seconds += secs
|
||||||
|
print(f"[rate-limit] FloodWait while scanning discussion group for parent {pid}; waiting {secs}s", flush=True)
|
||||||
|
await asyncio.sleep(secs + 1)
|
||||||
|
attempts += 1
|
||||||
|
continue
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
return rows
|
||||||
|
|
||||||
|
total_written = 0
|
||||||
|
processed = 0
|
||||||
|
total = len(list(parent_ids)) if hasattr(parent_ids, '__len__') else None
|
||||||
|
|
||||||
|
async def worker(pid: int):
|
||||||
|
nonlocal total_written, processed
|
||||||
|
async with sem:
|
||||||
|
rows = await handle_parent(int(pid))
|
||||||
|
async with write_lock:
|
||||||
|
if rows:
|
||||||
|
# Dedupe against existing pairs if provided (resume mode)
|
||||||
|
if existing_pairs is not None:
|
||||||
|
filtered: List[dict] = []
|
||||||
|
for r in rows:
|
||||||
|
try:
|
||||||
|
key = (int(r.get("parent_id")), int(r.get("id")))
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if key in existing_pairs:
|
||||||
|
continue
|
||||||
|
existing_pairs.add(key)
|
||||||
|
filtered.append(r)
|
||||||
|
rows = filtered
|
||||||
|
if rows:
|
||||||
|
writer.writerows(rows)
|
||||||
|
total_written += len(rows)
|
||||||
|
processed += 1
|
||||||
|
if processed % 10 == 0 or (rows and len(rows) > 0):
|
||||||
|
if total is not None:
|
||||||
|
print(f"[replies] processed {processed}/{total} parents; last parent {pid} wrote {len(rows)} replies; total replies {total_written}", flush=True)
|
||||||
|
else:
|
||||||
|
print(f"[replies] processed {processed} parents; last parent {pid} wrote {len(rows)} replies; total replies {total_written}", flush=True)
|
||||||
|
|
||||||
|
tasks = [asyncio.create_task(worker(pid)) for pid in parent_ids]
|
||||||
|
await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
await client.disconnect()
|
||||||
|
if flood_hits:
|
||||||
|
print(f"[rate-limit] Summary: {flood_hits} FloodWait events; total waited ~{flood_wait_seconds}s", flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
async def fetch_forwards(
|
||||||
|
channel: str,
|
||||||
|
parent_ids: Set[int],
|
||||||
|
output_csv: str,
|
||||||
|
start_date: Optional[str] = None,
|
||||||
|
end_date: Optional[str] = None,
|
||||||
|
scan_limit: Optional[int] = None,
|
||||||
|
append: bool = False,
|
||||||
|
session_name: str = "telegram",
|
||||||
|
phone: Optional[str] = None,
|
||||||
|
twofa_password: Optional[str] = None,
|
||||||
|
concurrency: int = 5,
|
||||||
|
chunk_size: int = 1000,
|
||||||
|
):
|
||||||
|
"""Best-effort: find forwarded messages within the SAME channel that reference the given parent_ids.
|
||||||
|
Telegram API does not provide a global reverse-lookup of forwards across all channels; we therefore scan
|
||||||
|
this channel's history and collect messages with fwd_from.channel_post matching a parent id.
|
||||||
|
"""
|
||||||
|
load_dotenv()
|
||||||
|
api_id = os.getenv("TELEGRAM_API_ID")
|
||||||
|
api_hash = os.getenv("TELEGRAM_API_HASH")
|
||||||
|
session_name = os.getenv("TELEGRAM_SESSION_NAME", session_name)
|
||||||
|
if not api_id or not api_hash:
|
||||||
|
raise RuntimeError("Missing TELEGRAM_API_ID/TELEGRAM_API_HASH in environment. See .env.example")
|
||||||
|
client = TelegramClient(session_name, int(api_id), api_hash)
|
||||||
|
await ensure_login(client, phone=phone, twofa_password=twofa_password)
|
||||||
|
|
||||||
|
import csv
|
||||||
|
|
||||||
|
# Rate limiting counters
|
||||||
|
flood_hits = 0
|
||||||
|
import csv
|
||||||
|
|
||||||
|
analyzer = SentimentIntensityAnalyzer()
|
||||||
|
os.makedirs(os.path.dirname(output_csv) or ".", exist_ok=True)
|
||||||
|
mode = "a" if append else "w"
|
||||||
|
write_lock = asyncio.Lock()
|
||||||
|
with open(output_csv, mode, newline="", encoding="utf-8") as f:
|
||||||
|
writer = csv.DictWriter(
|
||||||
|
f,
|
||||||
|
fieldnames=["parent_id", "id", "date", "message", "sender_id", "sentiment_compound", "url"],
|
||||||
|
)
|
||||||
|
need_header = True
|
||||||
|
try:
|
||||||
|
if append and os.path.exists(output_csv) and os.path.getsize(output_csv) > 0:
|
||||||
|
need_header = False
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
if need_header:
|
||||||
|
writer.writeheader()
|
||||||
|
|
||||||
|
parsed_start = datetime.fromisoformat(start_date) if start_date else None
|
||||||
|
parsed_end = datetime.fromisoformat(end_date) if end_date else None
|
||||||
|
|
||||||
|
# If no scan_limit provided, fall back to sequential scan to avoid unbounded concurrency
|
||||||
|
if scan_limit is None:
|
||||||
|
scanned = 0
|
||||||
|
matched = 0
|
||||||
|
async for msg in client.iter_messages(channel, limit=None):
|
||||||
|
dt = msg.date.replace(tzinfo=None) if msg.date else None
|
||||||
|
if parsed_start and dt and dt < parsed_start:
|
||||||
|
break
|
||||||
|
if parsed_end and dt and dt > parsed_end:
|
||||||
|
continue
|
||||||
|
fwd = getattr(msg, "fwd_from", None)
|
||||||
|
if not fwd:
|
||||||
|
continue
|
||||||
|
ch_post = getattr(fwd, "channel_post", None)
|
||||||
|
if ch_post and int(ch_post) in parent_ids:
|
||||||
|
text = msg.message or ""
|
||||||
|
sent = analyzer.polarity_scores(text).get("compound")
|
||||||
|
url = f"https://t.me/{channel.lstrip('@')}/{msg.id}" if msg.id else None
|
||||||
|
writer.writerow(
|
||||||
|
{
|
||||||
|
"parent_id": int(ch_post),
|
||||||
|
"id": msg.id,
|
||||||
|
"date": to_iso(dt) if dt else None,
|
||||||
|
"message": text,
|
||||||
|
"sender_id": getattr(msg, "sender_id", None),
|
||||||
|
"sentiment_compound": sent,
|
||||||
|
"url": url,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
matched += 1
|
||||||
|
scanned += 1
|
||||||
|
if scanned % 1000 == 0:
|
||||||
|
print(f"[forwards] scanned ~{scanned} messages; total forwards {matched}", flush=True)
|
||||||
|
else:
|
||||||
|
# Concurrent chunked scanning by id ranges
|
||||||
|
# Rate limiting counters
|
||||||
|
flood_hits = 0
|
||||||
|
flood_wait_seconds = 0
|
||||||
|
sem = asyncio.Semaphore(max(1, int(concurrency)))
|
||||||
|
progress_lock = asyncio.Lock()
|
||||||
|
matched_total = 0
|
||||||
|
completed_chunks = 0
|
||||||
|
|
||||||
|
# Determine latest message id
|
||||||
|
latest_msg = await client.get_messages(channel, limit=1)
|
||||||
|
latest_id = None
|
||||||
|
try:
|
||||||
|
latest_id = getattr(latest_msg, 'id', None) or (latest_msg[0].id if latest_msg else None)
|
||||||
|
except Exception:
|
||||||
|
latest_id = None
|
||||||
|
if not latest_id:
|
||||||
|
await client.disconnect()
|
||||||
|
return
|
||||||
|
|
||||||
|
total_chunks = max(1, (int(scan_limit) + int(chunk_size) - 1) // int(chunk_size))
|
||||||
|
|
||||||
|
async def process_chunk(idx: int):
|
||||||
|
nonlocal flood_hits, flood_wait_seconds
|
||||||
|
nonlocal matched_total, completed_chunks
|
||||||
|
max_id = latest_id - idx * int(chunk_size)
|
||||||
|
min_id = max(0, max_id - int(chunk_size))
|
||||||
|
attempts = 0
|
||||||
|
local_matches = 0
|
||||||
|
while attempts < 3:
|
||||||
|
try:
|
||||||
|
async with sem:
|
||||||
|
async for msg in client.iter_messages(channel, min_id=min_id, max_id=max_id):
|
||||||
|
dt = msg.date.replace(tzinfo=None) if msg.date else None
|
||||||
|
if parsed_start and dt and dt < parsed_start:
|
||||||
|
# This range reached before start; skip remaining in this chunk
|
||||||
|
break
|
||||||
|
if parsed_end and dt and dt > parsed_end:
|
||||||
|
continue
|
||||||
|
fwd = getattr(msg, "fwd_from", None)
|
||||||
|
if not fwd:
|
||||||
|
continue
|
||||||
|
ch_post = getattr(fwd, "channel_post", None)
|
||||||
|
if ch_post and int(ch_post) in parent_ids:
|
||||||
|
text = msg.message or ""
|
||||||
|
sent = analyzer.polarity_scores(text).get("compound")
|
||||||
|
url = f"https://t.me/{channel.lstrip('@')}/{msg.id}" if msg.id else None
|
||||||
|
async with write_lock:
|
||||||
|
writer.writerow(
|
||||||
|
{
|
||||||
|
"parent_id": int(ch_post),
|
||||||
|
"id": msg.id,
|
||||||
|
"date": to_iso(dt) if dt else None,
|
||||||
|
"message": text,
|
||||||
|
"sender_id": getattr(msg, "sender_id", None),
|
||||||
|
"sentiment_compound": sent,
|
||||||
|
"url": url,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
local_matches += 1
|
||||||
|
break
|
||||||
|
except FloodWaitError as e:
|
||||||
|
secs = int(getattr(e, 'seconds', 5))
|
||||||
|
flood_hits += 1
|
||||||
|
flood_wait_seconds += secs
|
||||||
|
print(f"[rate-limit] FloodWait while scanning ids {min_id}-{max_id}; waiting {secs}s", flush=True)
|
||||||
|
await asyncio.sleep(secs + 1)
|
||||||
|
attempts += 1
|
||||||
|
continue
|
||||||
|
except Exception:
|
||||||
|
# best-effort; skip this chunk
|
||||||
|
break
|
||||||
|
async with progress_lock:
|
||||||
|
matched_total += local_matches
|
||||||
|
completed_chunks += 1
|
||||||
|
print(
|
||||||
|
f"[forwards] chunks {completed_chunks}/{total_chunks}; last {min_id}-{max_id} wrote {local_matches} forwards; total forwards {matched_total}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
tasks = [asyncio.create_task(process_chunk(i)) for i in range(total_chunks)]
|
||||||
|
await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
await client.disconnect()
|
||||||
|
# Print summary if we used concurrent chunking
|
||||||
|
try:
|
||||||
|
if scan_limit is not None and 'flood_hits' in locals() and flood_hits:
|
||||||
|
print(f"[rate-limit] Summary: {flood_hits} FloodWait events; total waited ~{flood_wait_seconds}s", flush=True)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description="Telegram scraper utilities")
|
||||||
|
sub = parser.add_subparsers(dest="command", required=True)
|
||||||
|
|
||||||
|
# Subcommand: scrape channel history
|
||||||
|
p_scrape = sub.add_parser("scrape", help="Scrape messages from a channel")
|
||||||
|
p_scrape.add_argument("channel", help="Channel username or t.me link, e.g. @python, https://t.me/python")
|
||||||
|
p_scrape.add_argument("--output", "-o", required=True, help="Output file (.jsonl or .csv)")
|
||||||
|
p_scrape.add_argument("--limit", type=int, default=None, help="Max number of messages to save after filtering")
|
||||||
|
p_scrape.add_argument("--offset-date", dest="offset_date", default=None, help="Deprecated: use --start-date instead. ISO date (inclusive)")
|
||||||
|
p_scrape.add_argument("--start-date", dest="start_date", default=None, help="ISO start date (inclusive)")
|
||||||
|
p_scrape.add_argument("--end-date", dest="end_date", default=None, help="ISO end date (inclusive)")
|
||||||
|
p_scrape.add_argument("--append", action="store_true", help="Append to the output file instead of overwriting")
|
||||||
|
p_scrape.add_argument("--session-name", default=os.getenv("TELEGRAM_SESSION_NAME", "telegram"))
|
||||||
|
p_scrape.add_argument("--phone", default=None)
|
||||||
|
p_scrape.add_argument("--twofa-password", default=os.getenv("TELEGRAM_2FA_PASSWORD"))
|
||||||
|
|
||||||
|
# Subcommand: fetch replies for specific message ids
|
||||||
|
p_rep = sub.add_parser("replies", help="Fetch replies for given message IDs and save to CSV")
|
||||||
|
p_rep.add_argument("channel", help="Channel username or t.me link")
|
||||||
|
src = p_rep.add_mutually_exclusive_group(required=True)
|
||||||
|
src.add_argument("--ids", help="Comma-separated parent message IDs")
|
||||||
|
src.add_argument("--from-csv", dest="from_csv", help="Path to CSV with an 'id' column to use as parent IDs")
|
||||||
|
p_rep.add_argument("--output", "-o", required=True, help="Output CSV path (e.g., data/replies_channel.csv)")
|
||||||
|
p_rep.add_argument("--append", action="store_true", help="Append to the output file instead of overwriting")
|
||||||
|
p_rep.add_argument("--session-name", default=os.getenv("TELEGRAM_SESSION_NAME", "telegram"))
|
||||||
|
p_rep.add_argument("--phone", default=None)
|
||||||
|
p_rep.add_argument("--twofa-password", default=os.getenv("TELEGRAM_2FA_PASSWORD"))
|
||||||
|
p_rep.add_argument("--concurrency", type=int, default=5, help="Number of parent IDs to process in parallel (default 5)")
|
||||||
|
p_rep.add_argument("--min-replies", type=int, default=None, help="When using --from-csv, only process parents with replies >= this value")
|
||||||
|
p_rep.add_argument("--resume", action="store_true", help="Resume mode: skip parent_id,id pairs already present in the output CSV")
|
||||||
|
|
||||||
|
# Subcommand: fetch forwards (same-channel forwards referencing parent ids)
|
||||||
|
p_fwd = sub.add_parser("forwards", help="Best-effort: find forwards within the same channel for given parent IDs")
|
||||||
|
p_fwd.add_argument("channel", help="Channel username or t.me link")
|
||||||
|
src2 = p_fwd.add_mutually_exclusive_group(required=True)
|
||||||
|
src2.add_argument("--ids", help="Comma-separated parent message IDs")
|
||||||
|
src2.add_argument("--from-csv", dest="from_csv", help="Path to CSV with an 'id' column to use as parent IDs")
|
||||||
|
p_fwd.add_argument("--output", "-o", required=True, help="Output CSV path (e.g., data/forwards_channel.csv)")
|
||||||
|
p_fwd.add_argument("--start-date", dest="start_date", default=None)
|
||||||
|
p_fwd.add_argument("--end-date", dest="end_date", default=None)
|
||||||
|
p_fwd.add_argument("--scan-limit", dest="scan_limit", type=int, default=None, help="Max messages to scan in channel history")
|
||||||
|
p_fwd.add_argument("--concurrency", type=int, default=5, help="Number of id-chunks to scan in parallel (requires --scan-limit)")
|
||||||
|
p_fwd.add_argument("--chunk-size", dest="chunk_size", type=int, default=1000, help="Approx. messages per chunk (ids)")
|
||||||
|
p_fwd.add_argument("--append", action="store_true", help="Append to the output file instead of overwriting")
|
||||||
|
p_fwd.add_argument("--session-name", default=os.getenv("TELEGRAM_SESSION_NAME", "telegram"))
|
||||||
|
p_fwd.add_argument("--phone", default=None)
|
||||||
|
p_fwd.add_argument("--twofa-password", default=os.getenv("TELEGRAM_2FA_PASSWORD"))
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Normalize channel
|
||||||
|
channel = getattr(args, "channel", None)
|
||||||
|
if channel and channel.startswith("https://t.me/"):
|
||||||
|
channel = channel.replace("https://t.me/", "@")
|
||||||
|
|
||||||
|
def _normalize_handle(ch: Optional[str]) -> Optional[str]:
|
||||||
|
if not ch:
|
||||||
|
return ch
|
||||||
|
# Expect inputs like '@name' or 'name'; return lowercase without leading '@'
|
||||||
|
return ch.lstrip('@').lower()
|
||||||
|
|
||||||
|
def _extract_handle_from_url(url: str) -> Optional[str]:
|
||||||
|
try:
|
||||||
|
if not url:
|
||||||
|
return None
|
||||||
|
# Accept forms like https://t.me/Name/123 or http(s)://t.me/c/<id>/<msg>
|
||||||
|
# Only public usernames (not /c/ links) can be compared reliably
|
||||||
|
if "/t.me/" in url:
|
||||||
|
# crude parse without urlparse to avoid dependency
|
||||||
|
after = url.split("t.me/")[-1]
|
||||||
|
parts = after.split('/')
|
||||||
|
if parts and parts[0] and parts[0] != 'c':
|
||||||
|
return parts[0]
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
return None
|
||||||
|
|
||||||
|
if args.command == "scrape":
|
||||||
|
written = asyncio.run(
|
||||||
|
scrape_channel(
|
||||||
|
channel=channel,
|
||||||
|
output=args.output,
|
||||||
|
limit=args.limit,
|
||||||
|
offset_date=args.offset_date,
|
||||||
|
start_date=args.start_date,
|
||||||
|
end_date=args.end_date,
|
||||||
|
append=getattr(args, "append", False),
|
||||||
|
session_name=args.session_name,
|
||||||
|
phone=args.phone,
|
||||||
|
twofa_password=args.twofa_password,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
print(f"Wrote {written} messages to {args.output}")
|
||||||
|
elif args.command == "replies":
|
||||||
|
# If using --from-csv, try to infer channel from URLs and warn on mismatch
|
||||||
|
try:
|
||||||
|
if getattr(args, 'from_csv', None):
|
||||||
|
import pandas as _pd # local import to keep startup light
|
||||||
|
# Read a small sample of URL column to detect handle
|
||||||
|
sample = _pd.read_csv(args.from_csv, usecols=['url'], nrows=20)
|
||||||
|
url_handles = [
|
||||||
|
_extract_handle_from_url(str(u)) for u in sample['url'].dropna().tolist() if isinstance(u, (str,))
|
||||||
|
]
|
||||||
|
inferred = next((h for h in url_handles if h), None)
|
||||||
|
provided = _normalize_handle(channel)
|
||||||
|
if inferred and provided and _normalize_handle(inferred) != provided:
|
||||||
|
print(
|
||||||
|
f"[warning] CSV appears to be from @{_normalize_handle(inferred)} but you passed -c @{provided}. "
|
||||||
|
f"Replies may be empty. Consider using -c https://t.me/{inferred}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
# Best-effort only; ignore any issues reading/inspecting CSV
|
||||||
|
pass
|
||||||
|
parent_ids: Set[int]
|
||||||
|
if getattr(args, "ids", None):
|
||||||
|
parent_ids = {int(x.strip()) for x in args.ids.split(",") if x.strip()}
|
||||||
|
else:
|
||||||
|
import pandas as pd # local import
|
||||||
|
usecols = ['id']
|
||||||
|
if args.min_replies is not None:
|
||||||
|
usecols.append('replies')
|
||||||
|
df = pd.read_csv(args.from_csv, usecols=usecols)
|
||||||
|
if args.min_replies is not None and 'replies' in df.columns:
|
||||||
|
df = df[df['replies'].fillna(0).astype(int) >= int(args.min_replies)]
|
||||||
|
parent_ids = set(int(x) for x in df['id'].dropna().astype(int).tolist())
|
||||||
|
existing_pairs = None
|
||||||
|
if args.resume and os.path.exists(args.output):
|
||||||
|
try:
|
||||||
|
import csv as _csv
|
||||||
|
existing_pairs = set()
|
||||||
|
with open(args.output, "r", encoding="utf-8") as _f:
|
||||||
|
reader = _csv.DictReader(_f)
|
||||||
|
for row in reader:
|
||||||
|
try:
|
||||||
|
existing_pairs.add((int(row.get("parent_id")), int(row.get("id"))))
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
except Exception:
|
||||||
|
existing_pairs = None
|
||||||
|
|
||||||
|
asyncio.run(
|
||||||
|
fetch_replies(
|
||||||
|
channel=channel,
|
||||||
|
parent_ids=sorted(parent_ids),
|
||||||
|
output_csv=args.output,
|
||||||
|
append=getattr(args, "append", False),
|
||||||
|
session_name=args.session_name,
|
||||||
|
phone=args.phone,
|
||||||
|
twofa_password=args.twofa_password,
|
||||||
|
concurrency=max(1, int(getattr(args, 'concurrency', 5))),
|
||||||
|
existing_pairs=existing_pairs,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
print(f"Saved replies to {args.output}")
|
||||||
|
elif args.command == "forwards":
|
||||||
|
parent_ids: Set[int]
|
||||||
|
if getattr(args, "ids", None):
|
||||||
|
parent_ids = {int(x.strip()) for x in args.ids.split(",") if x.strip()}
|
||||||
|
else:
|
||||||
|
import pandas as pd
|
||||||
|
df = pd.read_csv(args.from_csv)
|
||||||
|
parent_ids = set(int(x) for x in df['id'].dropna().astype(int).tolist())
|
||||||
|
asyncio.run(
|
||||||
|
fetch_forwards(
|
||||||
|
channel=channel,
|
||||||
|
parent_ids=parent_ids,
|
||||||
|
output_csv=args.output,
|
||||||
|
start_date=args.start_date,
|
||||||
|
end_date=args.end_date,
|
||||||
|
scan_limit=args.scan_limit,
|
||||||
|
concurrency=max(1, int(getattr(args, 'concurrency', 5))),
|
||||||
|
chunk_size=max(1, int(getattr(args, 'chunk_size', 1000))),
|
||||||
|
append=getattr(args, "append", False),
|
||||||
|
session_name=args.session_name,
|
||||||
|
phone=args.phone,
|
||||||
|
twofa_password=args.twofa_password,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
print(f"Saved forwards to {args.output}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
135
src/train_sentiment.py
Normal file
135
src/train_sentiment.py
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
from datasets import Dataset, ClassLabel
|
||||||
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
|
||||||
|
import inspect
|
||||||
|
import numpy as np
|
||||||
|
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
|
||||||
|
|
||||||
|
|
||||||
|
def build_dataset(df: pd.DataFrame, text_col: str, label_col: str, label_mapping: Optional[dict] = None) -> Dataset:
|
||||||
|
d = df[[text_col, label_col]].dropna().copy()
|
||||||
|
# Normalize and drop empty labels
|
||||||
|
d[label_col] = d[label_col].astype(str).str.strip()
|
||||||
|
d = d[d[label_col] != '']
|
||||||
|
if d.empty:
|
||||||
|
raise SystemExit("No labeled rows found. Please fill the 'label' column in your CSV (e.g., neg/neu/pos or 0/1/2).")
|
||||||
|
if label_mapping:
|
||||||
|
d[label_col] = d[label_col].map(label_mapping)
|
||||||
|
# If labels are strings, factorize them
|
||||||
|
if d[label_col].dtype == object:
|
||||||
|
d[label_col] = d[label_col].astype('category')
|
||||||
|
label2id = {k: int(v) for v, k in enumerate(d[label_col].cat.categories)}
|
||||||
|
id2label = {v: k for k, v in label2id.items()}
|
||||||
|
d[label_col] = d[label_col].cat.codes
|
||||||
|
else:
|
||||||
|
# Assume numeric 0..N-1
|
||||||
|
classes = sorted(d[label_col].unique().tolist())
|
||||||
|
label2id = {str(c): int(c) for c in classes}
|
||||||
|
id2label = {int(c): str(c) for c in classes}
|
||||||
|
hf = Dataset.from_pandas(d.reset_index(drop=True))
|
||||||
|
hf = hf.class_encode_column(label_col)
|
||||||
|
hf.features[label_col] = ClassLabel(num_classes=len(id2label), names=[id2label[i] for i in range(len(id2label))])
|
||||||
|
return hf, label2id, id2label
|
||||||
|
|
||||||
|
|
||||||
|
def tokenize_fn(examples, tokenizer, text_col):
|
||||||
|
return tokenizer(examples[text_col], truncation=True, padding=False)
|
||||||
|
|
||||||
|
|
||||||
|
def compute_metrics(eval_pred):
|
||||||
|
logits, labels = eval_pred
|
||||||
|
preds = np.argmax(logits, axis=-1)
|
||||||
|
return {
|
||||||
|
'accuracy': accuracy_score(labels, preds),
|
||||||
|
'precision_macro': precision_score(labels, preds, average='macro', zero_division=0),
|
||||||
|
'recall_macro': recall_score(labels, preds, average='macro', zero_division=0),
|
||||||
|
'f1_macro': f1_score(labels, preds, average='macro', zero_division=0),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Fine-tune a transformers model for sentiment.')
|
||||||
|
parser.add_argument('--train-csv', required=True, help='Path to labeled CSV')
|
||||||
|
parser.add_argument('--text-col', default='message', help='Text column name')
|
||||||
|
parser.add_argument('--label-col', default='label', help='Label column name (e.g., pos/neu/neg or 2/1/0)')
|
||||||
|
parser.add_argument('--model-name', default='distilbert-base-uncased', help='Base model name or path')
|
||||||
|
parser.add_argument('--output-dir', default='models/sentiment-distilbert', help='Where to save the fine-tuned model')
|
||||||
|
parser.add_argument('--epochs', type=int, default=3)
|
||||||
|
parser.add_argument('--batch-size', type=int, default=16)
|
||||||
|
parser.add_argument('--lr', type=float, default=5e-5)
|
||||||
|
parser.add_argument('--eval-split', type=float, default=0.1, help='Fraction of data for eval')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
os.makedirs(args.output_dir, exist_ok=True)
|
||||||
|
|
||||||
|
df = pd.read_csv(args.train_csv)
|
||||||
|
ds, label2id, id2label = build_dataset(df, args.text_col, args.label_col)
|
||||||
|
if args.eval_split > 0:
|
||||||
|
ds = ds.train_test_split(test_size=args.eval_split, seed=42, stratify_by_column=args.label_col)
|
||||||
|
train_ds, eval_ds = ds['train'], ds['test']
|
||||||
|
else:
|
||||||
|
train_ds, eval_ds = ds, None
|
||||||
|
|
||||||
|
num_labels = len(id2label)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(args.model_name)
|
||||||
|
model = AutoModelForSequenceClassification.from_pretrained(
|
||||||
|
args.model_name,
|
||||||
|
num_labels=num_labels,
|
||||||
|
id2label=id2label,
|
||||||
|
label2id={k: int(v) for k, v in label2id.items()},
|
||||||
|
)
|
||||||
|
|
||||||
|
tokenized_train = train_ds.map(lambda x: tokenize_fn(x, tokenizer, args.text_col), batched=True)
|
||||||
|
tokenized_eval = eval_ds.map(lambda x: tokenize_fn(x, tokenizer, args.text_col), batched=True) if (eval_ds is not None) else None
|
||||||
|
|
||||||
|
# Build TrainingArguments with compatibility across transformers versions
|
||||||
|
base_kwargs = {
|
||||||
|
'output_dir': args.output_dir,
|
||||||
|
'per_device_train_batch_size': args.batch_size,
|
||||||
|
'per_device_eval_batch_size': args.batch_size,
|
||||||
|
'num_train_epochs': args.epochs,
|
||||||
|
'learning_rate': args.lr,
|
||||||
|
'fp16': False,
|
||||||
|
'logging_steps': 50,
|
||||||
|
}
|
||||||
|
eval_kwargs = {}
|
||||||
|
if tokenized_eval is not None:
|
||||||
|
# Set both evaluation_strategy and eval_strategy for compatibility across transformers versions
|
||||||
|
eval_kwargs.update({
|
||||||
|
'evaluation_strategy': 'epoch',
|
||||||
|
'eval_strategy': 'epoch',
|
||||||
|
'save_strategy': 'epoch',
|
||||||
|
'load_best_model_at_end': True,
|
||||||
|
'metric_for_best_model': 'f1_macro',
|
||||||
|
'greater_is_better': True,
|
||||||
|
})
|
||||||
|
|
||||||
|
# Filter kwargs to only include parameters supported by this transformers version
|
||||||
|
sig = inspect.signature(TrainingArguments.__init__)
|
||||||
|
allowed = set(sig.parameters.keys())
|
||||||
|
def _filter(d: dict) -> dict:
|
||||||
|
return {k: v for k, v in d.items() if k in allowed}
|
||||||
|
|
||||||
|
training_args = TrainingArguments(**_filter(base_kwargs), **_filter(eval_kwargs))
|
||||||
|
|
||||||
|
trainer = Trainer(
|
||||||
|
model=model,
|
||||||
|
args=training_args,
|
||||||
|
train_dataset=tokenized_train,
|
||||||
|
eval_dataset=tokenized_eval,
|
||||||
|
tokenizer=tokenizer,
|
||||||
|
compute_metrics=compute_metrics if tokenized_eval else None,
|
||||||
|
)
|
||||||
|
|
||||||
|
trainer.train()
|
||||||
|
trainer.save_model(args.output_dir)
|
||||||
|
tokenizer.save_pretrained(args.output_dir)
|
||||||
|
print(f"Model saved to {args.output_dir}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
90
src/transformer_sentiment.py
Normal file
90
src/transformer_sentiment.py
Normal file
@@ -0,0 +1,90 @@
|
|||||||
|
from typing import List, Optional
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||||
|
import torch
|
||||||
|
|
||||||
|
|
||||||
|
class TransformerSentiment:
|
||||||
|
def __init__(self, model_name_or_path: str, device: Optional[str] = None, max_length: int = 256):
|
||||||
|
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
|
||||||
|
self.model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)
|
||||||
|
self.max_length = max_length
|
||||||
|
if device is None:
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
device = 'cuda'
|
||||||
|
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
||||||
|
device = 'mps'
|
||||||
|
else:
|
||||||
|
device = 'cpu'
|
||||||
|
self.device = device
|
||||||
|
self.model.to(self.device)
|
||||||
|
self.model.eval()
|
||||||
|
|
||||||
|
# Expect labels roughly like {0:'neg',1:'neu',2:'pos'} or similar
|
||||||
|
self.id2label = self.model.config.id2label if hasattr(self.model.config, 'id2label') else {0:'0',1:'1',2:'2'}
|
||||||
|
|
||||||
|
def _compound_from_probs(self, probs: np.ndarray) -> float:
|
||||||
|
# Map class probabilities to a [-1,1] compound-like score.
|
||||||
|
# If we have exactly 3 labels and names look like neg/neu/pos (any case), use that mapping.
|
||||||
|
labels = [self.id2label.get(i, str(i)).lower() for i in range(len(probs))]
|
||||||
|
try:
|
||||||
|
neg_idx = labels.index('neg') if 'neg' in labels else labels.index('negative')
|
||||||
|
except ValueError:
|
||||||
|
neg_idx = 0
|
||||||
|
try:
|
||||||
|
neu_idx = labels.index('neu') if 'neu' in labels else labels.index('neutral')
|
||||||
|
except ValueError:
|
||||||
|
neu_idx = 1 if len(probs) > 2 else None
|
||||||
|
try:
|
||||||
|
pos_idx = labels.index('pos') if 'pos' in labels else labels.index('positive')
|
||||||
|
except ValueError:
|
||||||
|
pos_idx = (len(probs)-1)
|
||||||
|
|
||||||
|
p_neg = float(probs[neg_idx]) if neg_idx is not None else 0.0
|
||||||
|
p_pos = float(probs[pos_idx]) if pos_idx is not None else 0.0
|
||||||
|
# A simple skew: pos - neg; keep within [-1,1]
|
||||||
|
comp = max(-1.0, min(1.0, p_pos - p_neg))
|
||||||
|
return comp
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def predict_compound_batch(self, texts: List[str], batch_size: int = 32) -> List[float]:
|
||||||
|
out: List[float] = []
|
||||||
|
for i in range(0, len(texts), batch_size):
|
||||||
|
batch = texts[i:i+batch_size]
|
||||||
|
enc = self.tokenizer(
|
||||||
|
batch,
|
||||||
|
padding=True,
|
||||||
|
truncation=True,
|
||||||
|
max_length=self.max_length,
|
||||||
|
return_tensors='pt'
|
||||||
|
)
|
||||||
|
enc = {k: v.to(self.device) for k, v in enc.items()}
|
||||||
|
logits = self.model(**enc).logits
|
||||||
|
probs = torch.softmax(logits, dim=-1).cpu().numpy()
|
||||||
|
for row in probs:
|
||||||
|
out.append(self._compound_from_probs(row))
|
||||||
|
return out
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def predict_probs_and_labels(self, texts: List[str], batch_size: int = 32):
|
||||||
|
probs_all = []
|
||||||
|
labels_all: List[str] = []
|
||||||
|
for i in range(0, len(texts), batch_size):
|
||||||
|
batch = texts[i:i+batch_size]
|
||||||
|
enc = self.tokenizer(
|
||||||
|
batch,
|
||||||
|
padding=True,
|
||||||
|
truncation=True,
|
||||||
|
max_length=self.max_length,
|
||||||
|
return_tensors='pt'
|
||||||
|
)
|
||||||
|
enc = {k: v.to(self.device) for k, v in enc.items()}
|
||||||
|
logits = self.model(**enc).logits
|
||||||
|
probs = torch.softmax(logits, dim=-1).cpu().numpy()
|
||||||
|
preds = probs.argmax(axis=-1)
|
||||||
|
for j, row in enumerate(probs):
|
||||||
|
probs_all.append(row)
|
||||||
|
label = self.id2label.get(int(preds[j]), str(int(preds[j])))
|
||||||
|
labels_all.append(label)
|
||||||
|
return probs_all, labels_all
|
||||||
Reference in New Issue
Block a user