docs/COMMANDS.md

# Project command reference

This file lists all supported commands and practical permutations for `./run_scraper.sh`, with short comments and tips. It mirrors the actual CLI flags in the code.

- Shell: zsh (macOS) — commands below are ready to paste.
- Env: A `.venv` is created automatically; dependencies installed from `requirements.txt`.
- Secrets: Create `.env` with TELEGRAM_API_ID and TELEGRAM_API_HASH; for fixtures also set FOOTBALL_DATA_API_TOKEN.
- 2FA: If you use Telegram two-step verification, set TELEGRAM_2FA_PASSWORD in `.env` (the shell wrapper doesn’t accept a flag for this).
- Sessions: Telethon uses a SQLite session file (default `telegram.session`). When running multiple tools in parallel, use distinct `--session-name` values.

## Common conventions

- Channels
  - Use either handle or URL: `-c @name` or `-c https://t.me/name`.
  - For replies, the channel must match the posts’ source in your CSV `url` column.
- Output behavior
  - scrape/replies/forwards overwrite unless you pass `--append`.
  - analyze always overwrites its outputs.
- Rate-limits
  - Replies/forwards log `[rate-limit]` if Telegram asks you to wait. Reduce `--concurrency` if frequent.
- Parallel runs
  - Add `--session-name <unique>` per process to avoid “database is locked”. Prefer sessions outside iCloud Drive.

---

## Scrape (posts/messages)

Minimal (overwrite output):
```zsh
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv
```

With date range and limit:
```zsh
./run_scraper.sh scrape \
  -c https://t.me/SomeChannel \
  -o data/messages.jsonl \
  --start-date 2025-01-01 \
  --end-date 2025-03-31 \
  --limit 500
```

Legacy offset date (deprecated; prefer --start-date):
```zsh
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --offset-date 2025-01-01
```

Append to existing file and pass phone on first login:
```zsh
./run_scraper.sh scrape \
  -c @SomeChannel \
  -o data/messages.csv \
  --append \
  --phone +15551234567
```

Use a custom session (useful in parallel):
```zsh
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --session-name telegram_scrape
```

Notes:
- Output format inferred by extension: `.csv` or `.jsonl`/`.ndjson`.
- Two-step verification: set TELEGRAM_2FA_PASSWORD in `.env` (no CLI flag in the shell wrapper).

### All valid forms (scrape)

Use one of the following combinations. Replace placeholders with your values.

- Base variables:
  - CH = @handle or https://t.me/handle
  - OUT = path to .csv or .jsonl
  - Optional value flags: [--limit N] [--session-name NAME] [--phone NUMBER]

- Date filter permutations (4) × Append flag (2) × Limit presence (2) = 16 forms

1) No dates, no append, no limit
  ./run_scraper.sh scrape -c CH -o OUT
2) No dates, no append, with limit
  ./run_scraper.sh scrape -c CH -o OUT --limit N
3) No dates, with append, no limit
  ./run_scraper.sh scrape -c CH -o OUT --append
4) No dates, with append, with limit
  ./run_scraper.sh scrape -c CH -o OUT --append --limit N
5) Start only, no append, no limit
  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD
6) Start only, no append, with limit
  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --limit N
7) Start only, with append, no limit
  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append
8) Start only, with append, with limit
  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append --limit N
9) End only, no append, no limit
  ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD
10) End only, no append, with limit
   ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --limit N
11) End only, with append, no limit
   ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append
12) End only, with append, with limit
   ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append --limit N
13) Start and end, no append, no limit
   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD
14) Start and end, no append, with limit
   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --limit N
15) Start and end, with append, no limit
   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append
16) Start and end, with append, with limit
   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append --limit N

Optional add-ons valid for any form above:
- Append [--session-name NAME] and/or [--phone NUMBER]
- Deprecated alternative to start-date: add [--offset-date YYYY-MM-DD]

---

## Replies (fetch replies to posts)

From a posts CSV (fast path; skip posts with 0 replies in CSV):
```zsh
./run_scraper.sh replies \
  -c https://t.me/SourceChannel \
  --from-csv data/messages.csv \
  -o data/replies.csv \
  --min-replies 1 \
  --concurrency 15 \
  --resume \
  --append
```

Using explicit message IDs:
```zsh
./run_scraper.sh replies \
  -c @SourceChannel \
  --ids "123,456,789" \
  -o data/replies.csv \
  --concurrency 5 \
  --append
```

IDs from a file (one per line) using zsh substitution:
```zsh
IDS=$(tr '\n' ',' < parent_ids.txt | sed 's/,$//')
./run_scraper.sh replies -c @SourceChannel --ids "$IDS" -o data/replies.csv --concurrency 8 --append
```

Parallel-safe session name:
```zsh
./run_scraper.sh replies -c @SourceChannel --from-csv data/messages.csv -o data/replies.csv --concurrency 12 --resume --append --session-name telegram_replies
```

What the flags do:
- `--from-csv PATH` reads parent IDs from a CSV with an `id` column (optionally filtered by `--min-replies`).
- `--ids` provides a comma-separated list of parent IDs.
- `--concurrency K` processes K parent IDs in parallel (default 5).
- `--resume` dedupes by `(parent_id,id)` pairs already present in the output.
- `--append` appends to output instead of overwriting.

Notes:
- The channel (`-c`) must match the posts’ source in your CSV URLs (the tool warns on mismatch).
- First login may require `--phone` (interactive prompt). For 2FA, set TELEGRAM_2FA_PASSWORD in `.env`.

### All valid forms (replies)

- Base variables:
  - CH = @handle or https://t.me/handle
  - OUT = path to .csv
  - Source: exactly one of S1 or S2
    - S1: --ids "id1,id2,..."
    - S2: --from-csv PATH [--min-replies N]
  - Optional: [--concurrency K] [--session-name NAME] [--phone NUMBER]
  - Binary: [--append], [--resume]

- Enumerated binary permutations for each source (4 per source = 8 total):

S1 + no append + no resume
  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT
S1 + no append + resume
  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --resume
S1 + append + no resume
  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append
S1 + append + resume
  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append --resume

S2 + no append + no resume
  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT
S2 + no append + resume
  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --resume
S2 + append + no resume
  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append
S2 + append + resume
  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append --resume

Optional add-ons valid for any form above:
- Add [--concurrency K] to tune speed; recommended 8–20
- With S2 you may add [--min-replies N] to prioritize parents with replies
- Add [--session-name NAME] and/or [--phone NUMBER]

---

## Forwards (same-channel forwards referencing posts)

Typical concurrent scan (best-effort; often zero results):
```zsh
./run_scraper.sh forwards \
  -c https://t.me/SourceChannel \
  --from-csv data/messages.csv \
  -o data/forwards.csv \
  --scan-limit 20000 \
  --concurrency 10 \
  --chunk-size 1500
```

With date filters (applied to scanned messages):
```zsh
./run_scraper.sh forwards \
  -c @SourceChannel \
  --from-csv data/messages.csv \
  -o data/forwards.csv \
  --start-date 2025-01-01 \
  --end-date 2025-03-31 \
  --scan-limit 10000 \
  --concurrency 8 \
  --chunk-size 1000
```

Using explicit message IDs:
```zsh
./run_scraper.sh forwards -c @SourceChannel --ids "100,200,300" -o data/forwards.csv --scan-limit 8000 --concurrency 6 --chunk-size 1000
```

Sequential mode (no chunking) by omitting --scan-limit:
```zsh
./run_scraper.sh forwards -c @SourceChannel --from-csv data/messages.csv -o data/forwards.csv
```

What the flags do:
- `--scan-limit N`: enables chunked, concurrent scanning of ~N recent message IDs.
- `--concurrency K`: number of id-chunks to scan in parallel (requires `--scan-limit`).
- `--chunk-size M`: approx. IDs per chunk (trade-off between balance/overhead). Start with 1000–2000.
- `--append`: append instead of overwrite.

Notes:
- This only finds forwards within the same channel that reference your parent IDs (self-forwards). Many channels will yield zero.
- Global cross-channel forward discovery is not supported here (can be added as a separate mode).
- Without `--scan-limit`, the tool scans sequentially from newest backwards and logs progress every ~1000 messages.

### All valid forms (forwards)

- Base variables:
  - CH = @handle or https://t.me/handle
  - OUT = path to .csv
  - Source: exactly one of S1 or S2
    - S1: --ids "id1,id2,..."
    - S2: --from-csv PATH
  - Modes:
    - M1: Sequential scan (omit --scan-limit)
    - M2: Chunked concurrent scan (requires --scan-limit N; accepts --concurrency K and --chunk-size M)
  - Optional date filters for both modes: [--start-date D] [--end-date D]
  - Binary: [--append]
  - Optional: [--session-name NAME] [--phone NUMBER]

- Enumerated permutations by mode, source, and append (2 modes × 2 sources × 2 append = 8 forms):

M1 + S1 + no append
  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT [--start-date D] [--end-date D]
M1 + S1 + append
  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --append [--start-date D] [--end-date D]
M1 + S2 + no append
  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT [--start-date D] [--end-date D]
M1 + S2 + append
  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --append [--start-date D] [--end-date D]

M2 + S1 + no append
  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
M2 + S1 + append
  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
M2 + S2 + no append
  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
M2 + S2 + append
  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]

Optional add-ons valid for any form above:
- Add [--session-name NAME] and/or [--phone NUMBER]

---

## Analyze (reports and tagging)

Posts-only report + tagged CSV:
```zsh
./run_scraper.sh analyze \
  -i data/messages.csv \
  --channel @SourceChannel \
  --tags-config config/tags.yaml \
  --fixtures-csv data/fixtures.csv \
  --write-augmented-csv
```
Outputs:
- `data/messages_report.md`
- `data/messages_tagged.csv`

Replies-only report + tagged CSV:
```zsh
./run_scraper.sh analyze \
  -i data/replies.csv \
  --channel "Replies - @SourceChannel" \
  --tags-config config/tags.yaml \
  --write-augmented-csv
```
Outputs:
- `data/replies_report.md`
- `data/replies_tagged.csv`

Combined (posts report augmented with replies):
```zsh
./run_scraper.sh analyze \
  -i data/messages.csv \
  --channel @SourceChannel \
  --tags-config config/tags.yaml \
  --replies-csv data/replies.csv \
  --fixtures-csv data/fixtures.csv \
  --write-augmented-csv \
  --write-combined-csv \
  --emoji-mode keep \
  --emoji-boost \
  --save-plots
```
Adds to posts dataset:
- `sentiment_compound` for posts (VADER)
- `replies_sentiment_mean` (avg reply sentiment per post)
- `replies_count_scraped` and `replies_top_tags` (rollup from replies)

Report sections include:
- Summary, top posts by views/forwards/replies
- Temporal distributions
- Per-tag engagement
- Per-tag sentiment (posts)
- Replies per-tag summary
- Per-tag sentiment (replies)
 - Combined sentiment (posts + replies)
 - Matchday cross-analysis (when `--fixtures-csv` is provided):
   - Posts: on vs off matchdays (counts and sentiment shares)
  - Posts engagement vs matchday (replies per post: total, mean, median, share of posts with replies)
   - Replies: on vs off matchdays (counts and sentiment shares)
  - Replies by parent matchday and by reply date are both shown; parent-based classification is recommended for engagement.

Notes:
- Analyze overwrites outputs; use `-o` to customize report filename if needed.
- Emoji handling: add `--emoji-mode keep|demojize|strip` (default keep). Optionally `--emoji-boost` to gently tilt scores when clearly positive/negative emojis are present.
 - Add `--write-combined-csv` to emit a unified CSV of posts+replies with a `content_type` column.

### All valid forms (analyze)

- Base variables:
  - IN = input CSV (posts or replies)
  - Optional outputs/labels: [-o REPORT.md] [--channel @handle]
  - Optional configs/data: [--tags-config config/tags.yaml] [--replies-csv REPLIES.csv] [--fixtures-csv FIXTURES.csv]
  - Binary: [--write-augmented-csv]

- Core permutations across replies-csv, fixtures-csv, write-augmented-csv (2×2×2 = 8 forms):

1) No replies, no fixtures, no aug
  ./run_scraper.sh analyze -i IN
2) No replies, no fixtures, with aug
  ./run_scraper.sh analyze -i IN --write-augmented-csv
3) No replies, with fixtures, no aug
  ./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv
4) No replies, with fixtures, with aug
  ./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv --write-augmented-csv
5) With replies, no fixtures, no aug
  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv
6) With replies, no fixtures, with aug
  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --write-augmented-csv
7) With replies, with fixtures, no aug
  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv
8) With replies, with fixtures, with aug
  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv --write-augmented-csv

Optional add-ons valid for any form above:
- Append [-o REPORT.md] to control output filename
- Append [--channel @handle] for title
- Append [--tags-config config/tags.yaml] to enable tagging and per-tag summaries
- Append [--emoji-mode keep|demojize|strip] and optionally [--emoji-boost]
- Append [--write-combined-csv] to produce a merged posts+replies CSV
 - Append [--save-plots] to emit plots to the data folder
 - Append [--sentiment-backend transformers] and [--transformers-model <name-or-path>] to use a local HF model instead of VADER
 - Append [--export-transformers-details] to include `sentiment_label` and `sentiment_probs` in augmented/combined CSVs
 - Append [--sentiment-backend gpt] and optionally [--gpt-model MODEL] [--gpt-base-url URL] [--gpt-batch-size K] to use a local GPT (Ollama) backend
 - Plot sizing and label controls (daily charts):
   - [--plot-width-scale FLOAT] [--plot-max-width INCHES] [--plot-height INCHES]
   - [--activity-top-n N]
   - [--labels-max-per-day N] [--labels-per-line N] [--labels-band-y FLOAT] [--labels-stagger-rows N] [--labels-annotate-mode ticks|all|ticks+top]

When fixtures are provided (`--fixtures-csv`):
- The report adds a "## Matchday cross-analysis" section with on vs off matchday tables.
- Plots include:
  - daily_activity_stacked.png with match labels inside the chart
  - daily_volume_and_sentiment.png (bars: volume; lines: pos%/neg%)
  - matchday_sentiment_overall.png (time series on fixture days)
  - matchday_posts_volume_vs_sentiment.png (scatter)
- The combined CSV (with `--write-combined-csv`) includes `is_matchday` and, for replies, `parent_is_matchday` when available.
- Replies are classified two ways: by reply date (`is_matchday` on the reply row) and by their parent post (`parent_is_matchday`). The latter better reflects matchday-driven engagement.

Emoji and plots examples:
```zsh
# Keep emojis (default) and boost for strong positive/negative emojis
./run_scraper.sh analyze -i data/messages.csv --emoji-mode keep --emoji-boost --save-plots

# Demojize to :smiling_face: tokens (helps some tokenizers), with boost
./run_scraper.sh analyze -i data/messages.csv --emoji-mode demojize --emoji-boost

# Strip emojis entirely (if they add noise)
./run_scraper.sh analyze -i data/messages.csv --emoji-mode strip --save-plots

# Use a transformers model for sentiment (will auto-download on first use unless a local path is provided).
# Tip: for an off-the-shelf sentiment head, try a fine-tuned model like SST-2:
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
  --sentiment-backend transformers \
  --transformers-model distilbert-base-uncased-finetuned-sst-2-english

## Local GPT backend (Ollama)

Use a local GPT model that returns JSON {label, confidence} per message; the analyzer maps this to a compound score and falls back to VADER on errors.

```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
  --sentiment-backend gpt \
  --gpt-model llama3 \
  --gpt-base-url http://localhost:11434 \
  --write-augmented-csv --write-combined-csv --save-plots
```
```

---

## Train a local transformers sentiment model

Prepare a labeled CSV with at least two columns: `message` and `label` (e.g., neg/neu/pos or 0/1/2).

Don’t have one yet? Create a labeling set from your existing posts/replies:

```zsh
# Generate a CSV to annotate by hand (adds a blank 'label' column)
./.venv/bin/python -m src.make_labeling_set \
  --posts-csv data/premier_league_update.csv \
  --replies-csv data/premier_league_replies.csv \
  --sample-size 1000 \
  -o data/labeled_sentiment.csv

# Or via alias (after sourcing scripts/aliases.zsh)
make_label_set "$POSTS_CSV" "$REPLIES_CSV" data/labeled_sentiment.csv 1000
```

Then fine-tune:

```zsh
# Ensure the venv exists (run any ./run_scraper.sh command once), then:
./.venv/bin/python -m src.train_sentiment \
  --train-csv data/labeled_sentiment.csv \
  --text-col message \
  --label-col label \
  --model-name distilbert-base-uncased \
  --output-dir models/sentiment-distilbert \
  --epochs 3 --batch-size 16
```

Use it in analyze:

```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
  --sentiment-backend transformers \
  --transformers-model models/sentiment-distilbert
```

Export details (labels, probabilities) into CSVs:

```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
  --sentiment-backend transformers \
  --transformers-model models/sentiment-distilbert \
  --export-transformers-details \
  --write-augmented-csv --write-combined-csv
```

Notes:
- The analyzer maps model class probabilities to a VADER-like compound score in [-1, 1] for compatibility with the rest of the report.
- If the model has id2label including 'neg','neu','pos' labels, the mapping is more accurate; otherwise it defaults to pos - neg.
- GPU/Apple Silicon (MPS) will be used automatically if available.

Torch install note (macOS):
- `requirements.txt` uses conditional pins: `torch==2.3.1` for Python < 3.13 and `torch>=2.7.1` for Python ≥ 3.13. This keeps installs smooth on macOS. If you hit install issues, let us know.

## Evaluate a fine-tuned model

```zsh
./.venv/bin/python -m src.eval_sentiment \
  --csv data/labeled_holdout.csv \
  --text-col message \
  --label-col label \
  --model models/sentiment-distilbert
```
Prints accuracy, macro-precision/recall/F1, and a classification report.

## Fixtures (Premier League schedule via football-data.org)

Fetch fixtures between dates:
```zsh
./run_scraper.sh fixtures \
  --start-date 2025-08-15 \
  --end-date 2025-10-15 \
  -o data/fixtures.csv
```

Notes:
- Requires `FOOTBALL_DATA_API_TOKEN` in `.env`.
- Output may be `.csv` or `.json` (by extension).

### All valid forms (fixtures)

- Base variables:
  - SD = start date YYYY-MM-DD
  - ED = end date YYYY-MM-DD
  - OUT = output .csv or .json

Form:
  ./run_scraper.sh fixtures --start-date SD --end-date ED -o OUT

---

## Advanced recipes

Parallel replies + forwards with separate sessions:
```zsh
# Terminal 1 – replies
./run_scraper.sh replies \
  -c https://t.me/SourceChannel \
  --from-csv data/messages.csv \
  -o data/replies.csv \
  --min-replies 1 \
  --concurrency 15 \
  --resume \
  --append \
  --session-name "$HOME/.local/share/telethon_sessions/telegram_replies"

# Terminal 2 – forwards
./run_scraper.sh forwards \
  -c https://t.me/SourceChannel \
  --from-csv data/messages.csv \
  -o data/forwards.csv \
  --scan-limit 20000 \
  --concurrency 10 \
  --chunk-size 1500 \
  --session-name "$HOME/.local/share/telethon_sessions/telegram_forwards"
```

Tuning for rate limits:
- If `[rate-limit]` logs are frequent, reduce `--concurrency` (start -3 to -5) and keep `--chunk-size` around 1000–2000.
- For replies, prioritize with `--min-replies 1` to avoid parents with zero replies.

Safety:
- Use `--append` with replies and `--resume` to avoid truncating and to dedupe.
- Forwards and scrape don’t dedupe; prefer writing to a new file or dedupe after.

---

## Environment setup quick-start

Create `.env` (script will prompt if missing):
```
TELEGRAM_API_ID=123456
TELEGRAM_API_HASH=your_api_hash
# Optional defaults
TELEGRAM_SESSION_NAME=telegram
TELEGRAM_2FA_PASSWORD=your_2fa_password
FOOTBALL_DATA_API_TOKEN=your_token
```

First run will prompt for phone and code (and 2FA if enabled).

---

## Troubleshooting

- Empty replies file
  - Ensure `-c` matches the channel in your posts CSV URLs.
  - Use `--append` so the file isn’t truncated before writing.
- “database is locked”
  - Use unique `--session-name` per parallel process; store sessions outside iCloud Drive.
- Forwards empty
  - Same-channel forwards are rare. This tool only finds self-forwards (not cross-channel).
- Analyze errors
  - Ensure CSVs have expected columns. Posts: `id,date,message,...`; Replies: `parent_id,id,date,message,...`.
- Exit code 1 when starting
  - Check the last log lines. Common causes: missing TELEGRAM_API_ID/HASH in `.env`, wrong channel handle vs CSV URLs, session file locked by another process (use distinct `--session-name`), or a bad output path.

---

## Quick aliases for daily runs (zsh) ⚡

Paste this section into your current shell or your `~/.zshrc` to get convenient Make-like commands.

### Project defaults (edit as needed)

```zsh
# Channel and files
export CH="https://t.me/Premier_League_Update"
export POSTS_CSV="data/premier_league_update.csv"
export REPLIES_CSV="data/premier_league_replies.csv"
export FORWARDS_CSV="data/premier_league_forwards.csv"
export TAGS_CFG="config/tags.yaml"
export FIXTURES_CSV="data/premier_league_schedule_2025-08-15_to_2025-10-15.csv"

# Sessions directory outside iCloud (avoid sqlite locks)
export SESSION_DIR="$HOME/.local/share/telethon_sessions"
mkdir -p "$SESSION_DIR"
```

### Aliases (zsh functions)

```zsh
# Fast replies: resume+append, prioritizes parents with replies, tuned concurrency
fast_replies() {
  local ch="${1:-$CH}"
  local posts="${2:-$POSTS_CSV}"
  local out="${3:-$REPLIES_CSV}"
  local conc="${4:-15}"
  local sess="${5:-$SESSION_DIR/telegram_replies}"
  ./run_scraper.sh replies \
    -c "$ch" \
    --from-csv "$posts" \
    -o "$out" \
    --min-replies 1 \
    --concurrency "$conc" \
    --resume \
    --append \
    --session-name "$sess"
}

# Chunked forwards: concurrent chunk scan with progress logs
chunked_forwards() {
  local ch="${1:-$CH}"
  local posts="${2:-$POSTS_CSV}"
  local out="${3:-$FORWARDS_CSV}"
  local scan="${4:-20000}"
  local conc="${5:-10}"
  local chunk="${6:-1500}"
  local sess="${7:-$SESSION_DIR/telegram_forwards}"
  ./run_scraper.sh forwards \
    -c "$ch" \
    --from-csv "$posts" \
    -o "$out" \
    --scan-limit "$scan" \
    --concurrency "$conc" \
    --chunk-size "$chunk" \
    --append \
    --session-name "$sess"
}

# Combined analyze: posts + replies + fixtures with tags; writes augmented CSVs
analyze_combined() {
  local posts="${1:-$POSTS_CSV}"
  local replies="${2:-$REPLIES_CSV}"
  local tags="${3:-$TAGS_CFG}"
  local fixtures="${4:-$FIXTURES_CSV}"
  local ch="${5:-$CH}"
  ./run_scraper.sh analyze \
    -i "$posts" \
    --channel "$ch" \
    --tags-config "$tags" \
    --replies-csv "$replies" \
    --fixtures-csv "$fixtures" \
    --write-augmented-csv \
    --write-combined-csv
}

# Emoji-aware analyze with sensible defaults (keep + boost)
analyze_emoji() {
  local posts="${1:-$POSTS_CSV}"
  local replies="${2:-$REPLIES_CSV}"
  local tags="${3:-$TAGS_CFG}"
  local fixtures="${4:-$FIXTURES_CSV}"
  local ch="${5:-$CH}"
  local mode="${6:-keep}"   # keep | demojize | strip
  ./run_scraper.sh analyze \
    -i "$posts" \
    --channel "$ch" \
    --tags-config "$tags" \
    --replies-csv "$replies" \
    --fixtures-csv "$fixtures" \
    --write-augmented-csv \
    --write-combined-csv \
    --emoji-mode "$mode" \
    --emoji-boost
}

# One-shot daily pipeline: fast replies then combined analyze
run_daily() {
  local ch="${1:-$CH}"
  local posts="${2:-$POSTS_CSV}"
  local replies="${3:-$REPLIES_CSV}"
  local conc="${4:-15}"
  fast_replies "$ch" "$posts" "$replies" "$conc" "$SESSION_DIR/telegram_replies"
  analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
}

# One-shot daily pipeline with forwards in parallel
run_daily_with_forwards() {
  local ch="${1:-$CH}"
  local posts="${2:-$POSTS_CSV}"
  local replies="${3:-$REPLIES_CSV}"
  local forwards="${4:-$FORWARDS_CSV}"
  local rep_conc="${5:-15}"
  local f_scan="${6:-20000}"
  local f_conc="${7:-10}"
  local f_chunk="${8:-1500}"
  local sess_r="${9:-$SESSION_DIR/telegram_replies}"
  local sess_f="${10:-$SESSION_DIR/telegram_forwards}"

  # Launch replies and forwards in parallel with separate sessions
  local pid_r pid_f
  fast_replies "$ch" "$posts" "$replies" "$rep_conc" "$sess_r" & pid_r=$!
  chunked_forwards "$ch" "$posts" "$forwards" "$f_scan" "$f_conc" "$f_chunk" "$sess_f" & pid_f=$!

  # Wait for completion and then analyze with emoji handling
  wait $pid_r
  wait $pid_f
  analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
}
```

### Usage

```zsh
# Use project defaults
fast_replies
chunked_forwards
analyze_combined

# Override on the fly (channel, files, or tuning)
fast_replies "https://t.me/AnotherChannel" data/other_posts.csv data/other_replies.csv 12
chunked_forwards "$CH" "$POSTS_CSV" data/alt_forwards.csv 30000 12 2000
analyze_combined data/other_posts.csv data/other_replies.csv "$TAGS_CFG" "$FIXTURES_CSV" "$CH"
```
-												chore(repo): initialize git with .gitignore, .gitattributes, and project sources

											
										
										
											2025-10-26 08:56:41 -04:00
+								# Project command reference
 								This file lists all supported commands and practical permutations for `./run_scraper.sh`, with short comments and tips. It mirrors the actual CLI flags in the code.
 								- Shell: zsh (macOS) — commands below are ready to paste.
 								- Env: A `.venv` is created automatically; dependencies installed from `requirements.txt`.
 								- Secrets: Create `.env` with TELEGRAM_API_ID and TELEGRAM_API_HASH; for fixtures also set FOOTBALL_DATA_API_TOKEN.
 								- 2FA: If you use Telegram two-step verification, set TELEGRAM_2FA_PASSWORD in `.env` (the shell wrapper doesn’t accept a flag for this).
 								- Sessions: Telethon uses a SQLite session file (default `telegram.session`). When running multiple tools in parallel, use distinct `--session-name` values.
 								## Common conventions
 								- Channels
 								  - Use either handle or URL: `-c @name` or `-c https://t.me/name`.
 								  - For replies, the channel must match the posts’ source in your CSV `url` column.
 								- Output behavior
 								  - scrape/replies/forwards overwrite unless you pass `--append`.
 								  - analyze always overwrites its outputs.
 								- Rate-limits
 								  - Replies/forwards log `[rate-limit]` if Telegram asks you to wait. Reduce `--concurrency` if frequent.
 								- Parallel runs
 								  - Add `--session-name <unique>` per process to avoid “database is locked”. Prefer sessions outside iCloud Drive.
 								---
 								## Scrape (posts/messages)
 								Minimal (overwrite output):
 								```zsh
 								./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv
 								```
 								With date range and limit:
 								```zsh
 								./run_scraper.sh scrape \
 								  -c https://t.me/SomeChannel \
 								  -o data/messages.jsonl \
 								  --start-date 2025-01-01 \
 								  --end-date 2025-03-31 \
 								  --limit 500
 								```
 								Legacy offset date (deprecated; prefer --start-date):
 								```zsh
 								./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --offset-date 2025-01-01
 								```
 								Append to existing file and pass phone on first login:
 								```zsh
 								./run_scraper.sh scrape \
 								  -c @SomeChannel \
 								  -o data/messages.csv \
 								  --append \
 								  --phone +15551234567
 								```
 								Use a custom session (useful in parallel):
 								```zsh
 								./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --session-name telegram_scrape
 								```
 								Notes:
 								- Output format inferred by extension: `.csv` or `.jsonl`/`.ndjson`.
 								- Two-step verification: set TELEGRAM_2FA_PASSWORD in `.env` (no CLI flag in the shell wrapper).
 								### All valid forms (scrape)
 								Use one of the following combinations. Replace placeholders with your values.
 								- Base variables:
 								  - CH = @handle or https://t.me/handle
 								  - OUT = path to .csv or .jsonl
 								  - Optional value flags: [--limit N] [--session-name NAME] [--phone NUMBER]
 								- Date filter permutations (4) × Append flag (2) × Limit presence (2) = 16 forms
 ) No dates, no append, no limit
 								  ./run_scraper.sh scrape -c CH -o OUT
 ) No dates, no append, with limit
 								  ./run_scraper.sh scrape -c CH -o OUT --limit N
 ) No dates, with append, no limit
 								  ./run_scraper.sh scrape -c CH -o OUT --append
 ) No dates, with append, with limit
 								  ./run_scraper.sh scrape -c CH -o OUT --append --limit N
 ) Start only, no append, no limit
 								  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD
 ) Start only, no append, with limit
 								  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --limit N
 ) Start only, with append, no limit
 								  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append
 ) Start only, with append, with limit
 								  ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append --limit N
 ) End only, no append, no limit
 								  ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD
 ) End only, no append, with limit
 								   ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --limit N
 ) End only, with append, no limit
 								   ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append
 ) End only, with append, with limit
 								   ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append --limit N
 ) Start and end, no append, no limit
 								   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD
 ) Start and end, no append, with limit
 								   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --limit N
 ) Start and end, with append, no limit
 								   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append
 ) Start and end, with append, with limit
 								   ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append --limit N
 								Optional add-ons valid for any form above:
 								- Append [--session-name NAME] and/or [--phone NUMBER]
 								- Deprecated alternative to start-date: add [--offset-date YYYY-MM-DD]
 								---
 								## Replies (fetch replies to posts)
 								From a posts CSV (fast path; skip posts with 0 replies in CSV):
 								```zsh
 								./run_scraper.sh replies \
 								  -c https://t.me/SourceChannel \
 								  --from-csv data/messages.csv \
 								  -o data/replies.csv \
 								  --min-replies 1 \
 								  --concurrency 15 \
 								  --resume \
 								  --append
 								```
 								Using explicit message IDs:
 								```zsh
 								./run_scraper.sh replies \
 								  -c @SourceChannel \
 								  --ids "123,456,789" \
 								  -o data/replies.csv \
 								  --concurrency 5 \
 								  --append
 								```
 								IDs from a file (one per line) using zsh substitution:
 								```zsh
 								IDS=$(tr '\n' ',' < parent_ids.txt | sed 's/,$//')
 								./run_scraper.sh replies -c @SourceChannel --ids "$IDS" -o data/replies.csv --concurrency 8 --append
 								```
 								Parallel-safe session name:
 								```zsh
 								./run_scraper.sh replies -c @SourceChannel --from-csv data/messages.csv -o data/replies.csv --concurrency 12 --resume --append --session-name telegram_replies
 								```
 								What the flags do:
 								- `--from-csv PATH` reads parent IDs from a CSV with an `id` column (optionally filtered by `--min-replies`).
 								- `--ids` provides a comma-separated list of parent IDs.
 								- `--concurrency K` processes K parent IDs in parallel (default 5).
 								- `--resume` dedupes by `(parent_id,id)` pairs already present in the output.
 								- `--append` appends to output instead of overwriting.
 								Notes:
 								- The channel (`-c`) must match the posts’ source in your CSV URLs (the tool warns on mismatch).
 								- First login may require `--phone` (interactive prompt). For 2FA, set TELEGRAM_2FA_PASSWORD in `.env`.
 								### All valid forms (replies)
 								- Base variables:
 								  - CH = @handle or https://t.me/handle
 								  - OUT = path to .csv
 								  - Source: exactly one of S1 or S2
 								    - S1: --ids "id1,id2,..."
 								    - S2: --from-csv PATH [--min-replies N]
 								  - Optional: [--concurrency K] [--session-name NAME] [--phone NUMBER]
 								  - Binary: [--append], [--resume]
 								- Enumerated binary permutations for each source (4 per source = 8 total):
 								S1 + no append + no resume
 								  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT
 								S1 + no append + resume
 								  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --resume
 								S1 + append + no resume
 								  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append
 								S1 + append + resume
 								  ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append --resume
 								S2 + no append + no resume
 								  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT
 								S2 + no append + resume
 								  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --resume
 								S2 + append + no resume
 								  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append
 								S2 + append + resume
 								  ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append --resume
 								Optional add-ons valid for any form above:
 								- Add [--concurrency K] to tune speed; recommended 8–20
 								- With S2 you may add [--min-replies N] to prioritize parents with replies
 								- Add [--session-name NAME] and/or [--phone NUMBER]
 								---
 								## Forwards (same-channel forwards referencing posts)
 								Typical concurrent scan (best-effort; often zero results):
 								```zsh
 								./run_scraper.sh forwards \
 								  -c https://t.me/SourceChannel \
 								  --from-csv data/messages.csv \
 								  -o data/forwards.csv \
 								  --scan-limit 20000 \
 								  --concurrency 10 \
 								  --chunk-size 1500
 								```
 								With date filters (applied to scanned messages):
 								```zsh
 								./run_scraper.sh forwards \
 								  -c @SourceChannel \
 								  --from-csv data/messages.csv \
 								  -o data/forwards.csv \
 								  --start-date 2025-01-01 \
 								  --end-date 2025-03-31 \
 								  --scan-limit 10000 \
 								  --concurrency 8 \
 								  --chunk-size 1000
 								```
 								Using explicit message IDs:
 								```zsh
 								./run_scraper.sh forwards -c @SourceChannel --ids "100,200,300" -o data/forwards.csv --scan-limit 8000 --concurrency 6 --chunk-size 1000
 								```
 								Sequential mode (no chunking) by omitting --scan-limit:
 								```zsh
 								./run_scraper.sh forwards -c @SourceChannel --from-csv data/messages.csv -o data/forwards.csv
 								```
 								What the flags do:
 								- `--scan-limit N`: enables chunked, concurrent scanning of ~N recent message IDs.
 								- `--concurrency K`: number of id-chunks to scan in parallel (requires `--scan-limit`).
 								- `--chunk-size M`: approx. IDs per chunk (trade-off between balance/overhead). Start with 1000–2000.
 								- `--append`: append instead of overwrite.
 								Notes:
 								- This only finds forwards within the same channel that reference your parent IDs (self-forwards). Many channels will yield zero.
 								- Global cross-channel forward discovery is not supported here (can be added as a separate mode).
 								- Without `--scan-limit`, the tool scans sequentially from newest backwards and logs progress every ~1000 messages.
 								### All valid forms (forwards)
 								- Base variables:
 								  - CH = @handle or https://t.me/handle
 								  - OUT = path to .csv
 								  - Source: exactly one of S1 or S2
 								    - S1: --ids "id1,id2,..."
 								    - S2: --from-csv PATH
 								  - Modes:
 								    - M1: Sequential scan (omit --scan-limit)
 								    - M2: Chunked concurrent scan (requires --scan-limit N; accepts --concurrency K and --chunk-size M)
 								  - Optional date filters for both modes: [--start-date D] [--end-date D]
 								  - Binary: [--append]
 								  - Optional: [--session-name NAME] [--phone NUMBER]
 								- Enumerated permutations by mode, source, and append (2 modes × 2 sources × 2 append = 8 forms):
 								M1 + S1 + no append
 								  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT [--start-date D] [--end-date D]
 								M1 + S1 + append
 								  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --append [--start-date D] [--end-date D]
 								M1 + S2 + no append
 								  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT [--start-date D] [--end-date D]
 								M1 + S2 + append
 								  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --append [--start-date D] [--end-date D]
 								M2 + S1 + no append
 								  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
 								M2 + S1 + append
 								  ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
 								M2 + S2 + no append
 								  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
 								M2 + S2 + append
 								  ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
 								Optional add-ons valid for any form above:
 								- Add [--session-name NAME] and/or [--phone NUMBER]
 								---
 								## Analyze (reports and tagging)
 								Posts-only report + tagged CSV:
 								```zsh
 								./run_scraper.sh analyze \
 								  -i data/messages.csv \
 								  --channel @SourceChannel \
 								  --tags-config config/tags.yaml \
 								  --fixtures-csv data/fixtures.csv \
 								  --write-augmented-csv
 								```
 								Outputs:
 								- `data/messages_report.md`
 								- `data/messages_tagged.csv`
 								Replies-only report + tagged CSV:
 								```zsh
 								./run_scraper.sh analyze \
 								  -i data/replies.csv \
 								  --channel "Replies - @SourceChannel" \
 								  --tags-config config/tags.yaml \
 								  --write-augmented-csv
 								```
 								Outputs:
 								- `data/replies_report.md`
 								- `data/replies_tagged.csv`
 								Combined (posts report augmented with replies):
 								```zsh
 								./run_scraper.sh analyze \
 								  -i data/messages.csv \
 								  --channel @SourceChannel \
 								  --tags-config config/tags.yaml \
 								  --replies-csv data/replies.csv \
 								  --fixtures-csv data/fixtures.csv \
 								  --write-augmented-csv \
 								  --write-combined-csv \
 								  --emoji-mode keep \
 								  --emoji-boost \
 								  --save-plots
 								```
 								Adds to posts dataset:
 								- `sentiment_compound` for posts (VADER)
 								- `replies_sentiment_mean` (avg reply sentiment per post)
 								- `replies_count_scraped` and `replies_top_tags` (rollup from replies)
 								Report sections include:
 								- Summary, top posts by views/forwards/replies
 								- Temporal distributions
 								- Per-tag engagement
 								- Per-tag sentiment (posts)
 								- Replies per-tag summary
 								- Per-tag sentiment (replies)
 								 - Combined sentiment (posts + replies)
 								 - Matchday cross-analysis (when `--fixtures-csv` is provided):
 								   - Posts: on vs off matchdays (counts and sentiment shares)
 								  - Posts engagement vs matchday (replies per post: total, mean, median, share of posts with replies)
 								   - Replies: on vs off matchdays (counts and sentiment shares)
 								  - Replies by parent matchday and by reply date are both shown; parent-based classification is recommended for engagement.
 								Notes:
 								- Analyze overwrites outputs; use `-o` to customize report filename if needed.
 								- Emoji handling: add `--emoji-mode keep|demojize|strip` (default keep). Optionally `--emoji-boost` to gently tilt scores when clearly positive/negative emojis are present.
 								 - Add `--write-combined-csv` to emit a unified CSV of posts+replies with a `content_type` column.
 								### All valid forms (analyze)
 								- Base variables:
 								  - IN = input CSV (posts or replies)
 								  - Optional outputs/labels: [-o REPORT.md] [--channel @handle]
 								  - Optional configs/data: [--tags-config config/tags.yaml] [--replies-csv REPLIES.csv] [--fixtures-csv FIXTURES.csv]
 								  - Binary: [--write-augmented-csv]
 								- Core permutations across replies-csv, fixtures-csv, write-augmented-csv (2×2×2 = 8 forms):
 ) No replies, no fixtures, no aug
 								  ./run_scraper.sh analyze -i IN
 ) No replies, no fixtures, with aug
 								  ./run_scraper.sh analyze -i IN --write-augmented-csv
 ) No replies, with fixtures, no aug
 								  ./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv
 ) No replies, with fixtures, with aug
 								  ./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv --write-augmented-csv
 ) With replies, no fixtures, no aug
 								  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv
 ) With replies, no fixtures, with aug
 								  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --write-augmented-csv
 ) With replies, with fixtures, no aug
 								  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv
 ) With replies, with fixtures, with aug
 								  ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv --write-augmented-csv
 								Optional add-ons valid for any form above:
 								- Append [-o REPORT.md] to control output filename
 								- Append [--channel @handle] for title
 								- Append [--tags-config config/tags.yaml] to enable tagging and per-tag summaries
 								- Append [--emoji-mode keep|demojize|strip] and optionally [--emoji-boost]
 								- Append [--write-combined-csv] to produce a merged posts+replies CSV
 								 - Append [--save-plots] to emit plots to the data folder
 								 - Append [--sentiment-backend transformers] and [--transformers-model <name-or-path>] to use a local HF model instead of VADER
 								 - Append [--export-transformers-details] to include `sentiment_label` and `sentiment_probs` in augmented/combined CSVs
 								 - Append [--sentiment-backend gpt] and optionally [--gpt-model MODEL] [--gpt-base-url URL] [--gpt-batch-size K] to use a local GPT (Ollama) backend
 								 - Plot sizing and label controls (daily charts):
 								   - [--plot-width-scale FLOAT] [--plot-max-width INCHES] [--plot-height INCHES]
 								   - [--activity-top-n N]
 								   - [--labels-max-per-day N] [--labels-per-line N] [--labels-band-y FLOAT] [--labels-stagger-rows N] [--labels-annotate-mode ticks|all|ticks+top]
 								When fixtures are provided (`--fixtures-csv`):
 								- The report adds a "## Matchday cross-analysis" section with on vs off matchday tables.
 								- Plots include:
 								  - daily_activity_stacked.png with match labels inside the chart
 								  - daily_volume_and_sentiment.png (bars: volume; lines: pos%/neg%)
 								  - matchday_sentiment_overall.png (time series on fixture days)
 								  - matchday_posts_volume_vs_sentiment.png (scatter)
 								- The combined CSV (with `--write-combined-csv`) includes `is_matchday` and, for replies, `parent_is_matchday` when available.
 								- Replies are classified two ways: by reply date (`is_matchday` on the reply row) and by their parent post (`parent_is_matchday`). The latter better reflects matchday-driven engagement.
 								Emoji and plots examples:
 								```zsh
 								# Keep emojis (default) and boost for strong positive/negative emojis
 								./run_scraper.sh analyze -i data/messages.csv --emoji-mode keep --emoji-boost --save-plots
 								# Demojize to :smiling_face: tokens (helps some tokenizers), with boost
 								./run_scraper.sh analyze -i data/messages.csv --emoji-mode demojize --emoji-boost
 								# Strip emojis entirely (if they add noise)
 								./run_scraper.sh analyze -i data/messages.csv --emoji-mode strip --save-plots
 								# Use a transformers model for sentiment (will auto-download on first use unless a local path is provided).
 								# Tip: for an off-the-shelf sentiment head, try a fine-tuned model like SST-2:
 								./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
 								  --sentiment-backend transformers \
 								  --transformers-model distilbert-base-uncased-finetuned-sst-2-english
 								## Local GPT backend (Ollama)
 								Use a local GPT model that returns JSON {label, confidence} per message; the analyzer maps this to a compound score and falls back to VADER on errors.
 								```zsh
 								./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
 								  --sentiment-backend gpt \
 								  --gpt-model llama3 \
 								  --gpt-base-url http://localhost:11434 \
 								  --write-augmented-csv --write-combined-csv --save-plots
 								```
 								```
 								---
 								## Train a local transformers sentiment model
 								Prepare a labeled CSV with at least two columns: `message` and `label` (e.g., neg/neu/pos or 0/1/2).
 								Don’t have one yet? Create a labeling set from your existing posts/replies:
 								```zsh
 								# Generate a CSV to annotate by hand (adds a blank 'label' column)
 								./.venv/bin/python -m src.make_labeling_set \
 								  --posts-csv data/premier_league_update.csv \
 								  --replies-csv data/premier_league_replies.csv \
 								  --sample-size 1000 \
 								  -o data/labeled_sentiment.csv
 								# Or via alias (after sourcing scripts/aliases.zsh)
 								make_label_set "$POSTS_CSV" "$REPLIES_CSV" data/labeled_sentiment.csv 1000
 								```
 								Then fine-tune:
 								```zsh
 								# Ensure the venv exists (run any ./run_scraper.sh command once), then:
 								./.venv/bin/python -m src.train_sentiment \
 								  --train-csv data/labeled_sentiment.csv \
 								  --text-col message \
 								  --label-col label \
 								  --model-name distilbert-base-uncased \
 								  --output-dir models/sentiment-distilbert \
 								  --epochs 3 --batch-size 16
 								```
 								Use it in analyze:
 								```zsh
 								./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
 								  --sentiment-backend transformers \
 								  --transformers-model models/sentiment-distilbert
 								```
 								Export details (labels, probabilities) into CSVs:
 								```zsh
 								./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
 								  --sentiment-backend transformers \
 								  --transformers-model models/sentiment-distilbert \
 								  --export-transformers-details \
 								  --write-augmented-csv --write-combined-csv
 								```
 								Notes:
 								- The analyzer maps model class probabilities to a VADER-like compound score in [-1, 1] for compatibility with the rest of the report.
 								- If the model has id2label including 'neg','neu','pos' labels, the mapping is more accurate; otherwise it defaults to pos - neg.
 								- GPU/Apple Silicon (MPS) will be used automatically if available.
 								Torch install note (macOS):
 								- `requirements.txt` uses conditional pins: `torch==2.3.1` for Python < 3.13 and `torch>=2.7.1` for Python ≥ 3.13. This keeps installs smooth on macOS. If you hit install issues, let us know.
 								## Evaluate a fine-tuned model
 								```zsh
 								./.venv/bin/python -m src.eval_sentiment \
 								  --csv data/labeled_holdout.csv \
 								  --text-col message \
 								  --label-col label \
 								  --model models/sentiment-distilbert
 								```
 								Prints accuracy, macro-precision/recall/F1, and a classification report.
 								## Fixtures (Premier League schedule via football-data.org)
 								Fetch fixtures between dates:
 								```zsh
 								./run_scraper.sh fixtures \
 								  --start-date 2025-08-15 \
 								  --end-date 2025-10-15 \
 								  -o data/fixtures.csv
 								```
 								Notes:
 								- Requires `FOOTBALL_DATA_API_TOKEN` in `.env`.
 								- Output may be `.csv` or `.json` (by extension).
 								### All valid forms (fixtures)
 								- Base variables:
 								  - SD = start date YYYY-MM-DD
 								  - ED = end date YYYY-MM-DD
 								  - OUT = output .csv or .json
 								Form:
 								  ./run_scraper.sh fixtures --start-date SD --end-date ED -o OUT
 								---
 								## Advanced recipes
 								Parallel replies + forwards with separate sessions:
 								```zsh
 								# Terminal 1 – replies
 								./run_scraper.sh replies \
 								  -c https://t.me/SourceChannel \
 								  --from-csv data/messages.csv \
 								  -o data/replies.csv \
 								  --min-replies 1 \
 								  --concurrency 15 \
 								  --resume \
 								  --append \
 								  --session-name "$HOME/.local/share/telethon_sessions/telegram_replies"
 								# Terminal 2 – forwards
 								./run_scraper.sh forwards \
 								  -c https://t.me/SourceChannel \
 								  --from-csv data/messages.csv \
 								  -o data/forwards.csv \
 								  --scan-limit 20000 \
 								  --concurrency 10 \
 								  --chunk-size 1500 \
 								  --session-name "$HOME/.local/share/telethon_sessions/telegram_forwards"
 								```
 								Tuning for rate limits:
 								- If `[rate-limit]` logs are frequent, reduce `--concurrency` (start -3 to -5) and keep `--chunk-size` around 1000–2000.
 								- For replies, prioritize with `--min-replies 1` to avoid parents with zero replies.
 								Safety:
 								- Use `--append` with replies and `--resume` to avoid truncating and to dedupe.
 								- Forwards and scrape don’t dedupe; prefer writing to a new file or dedupe after.
 								---
 								## Environment setup quick-start
 								Create `.env` (script will prompt if missing):
 								```
 								TELEGRAM_API_ID=123456
 								TELEGRAM_API_HASH=your_api_hash
 								# Optional defaults
 								TELEGRAM_SESSION_NAME=telegram
 								TELEGRAM_2FA_PASSWORD=your_2fa_password
 								FOOTBALL_DATA_API_TOKEN=your_token
 								```
 								First run will prompt for phone and code (and 2FA if enabled).
 								---
 								## Troubleshooting
 								- Empty replies file
 								  - Ensure `-c` matches the channel in your posts CSV URLs.
 								  - Use `--append` so the file isn’t truncated before writing.
 								- “database is locked”
 								  - Use unique `--session-name` per parallel process; store sessions outside iCloud Drive.
 								- Forwards empty
 								  - Same-channel forwards are rare. This tool only finds self-forwards (not cross-channel).
 								- Analyze errors
 								  - Ensure CSVs have expected columns. Posts: `id,date,message,...`; Replies: `parent_id,id,date,message,...`.
 								- Exit code 1 when starting
 								  - Check the last log lines. Common causes: missing TELEGRAM_API_ID/HASH in `.env`, wrong channel handle vs CSV URLs, session file locked by another process (use distinct `--session-name`), or a bad output path.
 								---
 								## Quick aliases for daily runs (zsh) ⚡
 								Paste this section into your current shell or your `~/.zshrc` to get convenient Make-like commands.
 								### Project defaults (edit as needed)
 								```zsh
 								# Channel and files
 								export CH="https://t.me/Premier_League_Update"
 								export POSTS_CSV="data/premier_league_update.csv"
 								export REPLIES_CSV="data/premier_league_replies.csv"
 								export FORWARDS_CSV="data/premier_league_forwards.csv"
 								export TAGS_CFG="config/tags.yaml"
 								export FIXTURES_CSV="data/premier_league_schedule_2025-08-15_to_2025-10-15.csv"
 								# Sessions directory outside iCloud (avoid sqlite locks)
 								export SESSION_DIR="$HOME/.local/share/telethon_sessions"
 								mkdir -p "$SESSION_DIR"
 								```
 								### Aliases (zsh functions)
 								```zsh
 								# Fast replies: resume+append, prioritizes parents with replies, tuned concurrency
 								fast_replies() {
 								  local ch="${1:-$CH}"
 								  local posts="${2:-$POSTS_CSV}"
 								  local out="${3:-$REPLIES_CSV}"
 								  local conc="${4:-15}"
 								  local sess="${5:-$SESSION_DIR/telegram_replies}"
 								  ./run_scraper.sh replies \
 								    -c "$ch" \
 								    --from-csv "$posts" \
 								    -o "$out" \
 								    --min-replies 1 \
 								    --concurrency "$conc" \
 								    --resume \
 								    --append \
 								    --session-name "$sess"
 								}
 								# Chunked forwards: concurrent chunk scan with progress logs
 								chunked_forwards() {
 								  local ch="${1:-$CH}"
 								  local posts="${2:-$POSTS_CSV}"
 								  local out="${3:-$FORWARDS_CSV}"
 								  local scan="${4:-20000}"
 								  local conc="${5:-10}"
 								  local chunk="${6:-1500}"
 								  local sess="${7:-$SESSION_DIR/telegram_forwards}"
 								  ./run_scraper.sh forwards \
 								    -c "$ch" \
 								    --from-csv "$posts" \
 								    -o "$out" \
 								    --scan-limit "$scan" \
 								    --concurrency "$conc" \
 								    --chunk-size "$chunk" \
 								    --append \
 								    --session-name "$sess"
 								}
 								# Combined analyze: posts + replies + fixtures with tags; writes augmented CSVs
 								analyze_combined() {
 								  local posts="${1:-$POSTS_CSV}"
 								  local replies="${2:-$REPLIES_CSV}"
 								  local tags="${3:-$TAGS_CFG}"
 								  local fixtures="${4:-$FIXTURES_CSV}"
 								  local ch="${5:-$CH}"
 								  ./run_scraper.sh analyze \
 								    -i "$posts" \
 								    --channel "$ch" \
 								    --tags-config "$tags" \
 								    --replies-csv "$replies" \
 								    --fixtures-csv "$fixtures" \
 								    --write-augmented-csv \
 								    --write-combined-csv
 								}
 								# Emoji-aware analyze with sensible defaults (keep + boost)
 								analyze_emoji() {
 								  local posts="${1:-$POSTS_CSV}"
 								  local replies="${2:-$REPLIES_CSV}"
 								  local tags="${3:-$TAGS_CFG}"
 								  local fixtures="${4:-$FIXTURES_CSV}"
 								  local ch="${5:-$CH}"
 								  local mode="${6:-keep}"   # keep | demojize | strip
 								  ./run_scraper.sh analyze \
 								    -i "$posts" \
 								    --channel "$ch" \
 								    --tags-config "$tags" \
 								    --replies-csv "$replies" \
 								    --fixtures-csv "$fixtures" \
 								    --write-augmented-csv \
 								    --write-combined-csv \
 								    --emoji-mode "$mode" \
 								    --emoji-boost
 								}
 								# One-shot daily pipeline: fast replies then combined analyze
 								run_daily() {
 								  local ch="${1:-$CH}"
 								  local posts="${2:-$POSTS_CSV}"
 								  local replies="${3:-$REPLIES_CSV}"
 								  local conc="${4:-15}"
 								  fast_replies "$ch" "$posts" "$replies" "$conc" "$SESSION_DIR/telegram_replies"
 								  analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
 								}
 								# One-shot daily pipeline with forwards in parallel
 								run_daily_with_forwards() {
 								  local ch="${1:-$CH}"
 								  local posts="${2:-$POSTS_CSV}"
 								  local replies="${3:-$REPLIES_CSV}"
 								  local forwards="${4:-$FORWARDS_CSV}"
 								  local rep_conc="${5:-15}"
 								  local f_scan="${6:-20000}"
 								  local f_conc="${7:-10}"
 								  local f_chunk="${8:-1500}"
 								  local sess_r="${9:-$SESSION_DIR/telegram_replies}"
 								  local sess_f="${10:-$SESSION_DIR/telegram_forwards}"
 								  # Launch replies and forwards in parallel with separate sessions
 								  local pid_r pid_f
 								  fast_replies "$ch" "$posts" "$replies" "$rep_conc" "$sess_r" & pid_r=$!
 								  chunked_forwards "$ch" "$posts" "$forwards" "$f_scan" "$f_conc" "$f_chunk" "$sess_f" & pid_f=$!
 								  # Wait for completion and then analyze with emoji handling
 								  wait $pid_r
 								  wait $pid_f
 								  analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
 								}
 								```
 								### Usage
 								```zsh
 								# Use project defaults
 								fast_replies
 								chunked_forwards
 								analyze_combined
 								# Override on the fly (channel, files, or tuning)
 								fast_replies "https://t.me/AnotherChannel" data/other_posts.csv data/other_replies.csv 12
 								chunked_forwards "$CH" "$POSTS_CSV" data/alt_forwards.csv 30000 12 2000
 								analyze_combined data/other_posts.csv data/other_replies.csv "$TAGS_CFG" "$FIXTURES_CSV" "$CH"
 								```