Files
epl_python/docs/COMMANDS.md

744 lines
26 KiB
Markdown
Raw Normal View History

# Project command reference
This file lists all supported commands and practical permutations for `./run_scraper.sh`, with short comments and tips. It mirrors the actual CLI flags in the code.
- Shell: zsh (macOS) — commands below are ready to paste.
- Env: A `.venv` is created automatically; dependencies installed from `requirements.txt`.
- Secrets: Create `.env` with TELEGRAM_API_ID and TELEGRAM_API_HASH; for fixtures also set FOOTBALL_DATA_API_TOKEN.
- 2FA: If you use Telegram two-step verification, set TELEGRAM_2FA_PASSWORD in `.env` (the shell wrapper doesnt accept a flag for this).
- Sessions: Telethon uses a SQLite session file (default `telegram.session`). When running multiple tools in parallel, use distinct `--session-name` values.
## Common conventions
- Channels
- Use either handle or URL: `-c @name` or `-c https://t.me/name`.
- For replies, the channel must match the posts source in your CSV `url` column.
- Output behavior
- scrape/replies/forwards overwrite unless you pass `--append`.
- analyze always overwrites its outputs.
- Rate-limits
- Replies/forwards log `[rate-limit]` if Telegram asks you to wait. Reduce `--concurrency` if frequent.
- Parallel runs
- Add `--session-name <unique>` per process to avoid “database is locked”. Prefer sessions outside iCloud Drive.
---
## Scrape (posts/messages)
Minimal (overwrite output):
```zsh
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv
```
With date range and limit:
```zsh
./run_scraper.sh scrape \
-c https://t.me/SomeChannel \
-o data/messages.jsonl \
--start-date 2025-01-01 \
--end-date 2025-03-31 \
--limit 500
```
Legacy offset date (deprecated; prefer --start-date):
```zsh
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --offset-date 2025-01-01
```
Append to existing file and pass phone on first login:
```zsh
./run_scraper.sh scrape \
-c @SomeChannel \
-o data/messages.csv \
--append \
--phone +15551234567
```
Use a custom session (useful in parallel):
```zsh
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --session-name telegram_scrape
```
Notes:
- Output format inferred by extension: `.csv` or `.jsonl`/`.ndjson`.
- Two-step verification: set TELEGRAM_2FA_PASSWORD in `.env` (no CLI flag in the shell wrapper).
### All valid forms (scrape)
Use one of the following combinations. Replace placeholders with your values.
- Base variables:
- CH = @handle or https://t.me/handle
- OUT = path to .csv or .jsonl
- Optional value flags: [--limit N] [--session-name NAME] [--phone NUMBER]
- Date filter permutations (4) × Append flag (2) × Limit presence (2) = 16 forms
1) No dates, no append, no limit
./run_scraper.sh scrape -c CH -o OUT
2) No dates, no append, with limit
./run_scraper.sh scrape -c CH -o OUT --limit N
3) No dates, with append, no limit
./run_scraper.sh scrape -c CH -o OUT --append
4) No dates, with append, with limit
./run_scraper.sh scrape -c CH -o OUT --append --limit N
5) Start only, no append, no limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD
6) Start only, no append, with limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --limit N
7) Start only, with append, no limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append
8) Start only, with append, with limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append --limit N
9) End only, no append, no limit
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD
10) End only, no append, with limit
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --limit N
11) End only, with append, no limit
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append
12) End only, with append, with limit
./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append --limit N
13) Start and end, no append, no limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD
14) Start and end, no append, with limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --limit N
15) Start and end, with append, no limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append
16) Start and end, with append, with limit
./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append --limit N
Optional add-ons valid for any form above:
- Append [--session-name NAME] and/or [--phone NUMBER]
- Deprecated alternative to start-date: add [--offset-date YYYY-MM-DD]
---
## Replies (fetch replies to posts)
From a posts CSV (fast path; skip posts with 0 replies in CSV):
```zsh
./run_scraper.sh replies \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/replies.csv \
--min-replies 1 \
--concurrency 15 \
--resume \
--append
```
Using explicit message IDs:
```zsh
./run_scraper.sh replies \
-c @SourceChannel \
--ids "123,456,789" \
-o data/replies.csv \
--concurrency 5 \
--append
```
IDs from a file (one per line) using zsh substitution:
```zsh
IDS=$(tr '\n' ',' < parent_ids.txt | sed 's/,$//')
./run_scraper.sh replies -c @SourceChannel --ids "$IDS" -o data/replies.csv --concurrency 8 --append
```
Parallel-safe session name:
```zsh
./run_scraper.sh replies -c @SourceChannel --from-csv data/messages.csv -o data/replies.csv --concurrency 12 --resume --append --session-name telegram_replies
```
What the flags do:
- `--from-csv PATH` reads parent IDs from a CSV with an `id` column (optionally filtered by `--min-replies`).
- `--ids` provides a comma-separated list of parent IDs.
- `--concurrency K` processes K parent IDs in parallel (default 5).
- `--resume` dedupes by `(parent_id,id)` pairs already present in the output.
- `--append` appends to output instead of overwriting.
Notes:
- The channel (`-c`) must match the posts source in your CSV URLs (the tool warns on mismatch).
- First login may require `--phone` (interactive prompt). For 2FA, set TELEGRAM_2FA_PASSWORD in `.env`.
### All valid forms (replies)
- Base variables:
- CH = @handle or https://t.me/handle
- OUT = path to .csv
- Source: exactly one of S1 or S2
- S1: --ids "id1,id2,..."
- S2: --from-csv PATH [--min-replies N]
- Optional: [--concurrency K] [--session-name NAME] [--phone NUMBER]
- Binary: [--append], [--resume]
- Enumerated binary permutations for each source (4 per source = 8 total):
S1 + no append + no resume
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT
S1 + no append + resume
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --resume
S1 + append + no resume
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append
S1 + append + resume
./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append --resume
S2 + no append + no resume
./run_scraper.sh replies -c CH --from-csv PATH -o OUT
S2 + no append + resume
./run_scraper.sh replies -c CH --from-csv PATH -o OUT --resume
S2 + append + no resume
./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append
S2 + append + resume
./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append --resume
Optional add-ons valid for any form above:
- Add [--concurrency K] to tune speed; recommended 820
- With S2 you may add [--min-replies N] to prioritize parents with replies
- Add [--session-name NAME] and/or [--phone NUMBER]
---
## Forwards (same-channel forwards referencing posts)
Typical concurrent scan (best-effort; often zero results):
```zsh
./run_scraper.sh forwards \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/forwards.csv \
--scan-limit 20000 \
--concurrency 10 \
--chunk-size 1500
```
With date filters (applied to scanned messages):
```zsh
./run_scraper.sh forwards \
-c @SourceChannel \
--from-csv data/messages.csv \
-o data/forwards.csv \
--start-date 2025-01-01 \
--end-date 2025-03-31 \
--scan-limit 10000 \
--concurrency 8 \
--chunk-size 1000
```
Using explicit message IDs:
```zsh
./run_scraper.sh forwards -c @SourceChannel --ids "100,200,300" -o data/forwards.csv --scan-limit 8000 --concurrency 6 --chunk-size 1000
```
Sequential mode (no chunking) by omitting --scan-limit:
```zsh
./run_scraper.sh forwards -c @SourceChannel --from-csv data/messages.csv -o data/forwards.csv
```
What the flags do:
- `--scan-limit N`: enables chunked, concurrent scanning of ~N recent message IDs.
- `--concurrency K`: number of id-chunks to scan in parallel (requires `--scan-limit`).
- `--chunk-size M`: approx. IDs per chunk (trade-off between balance/overhead). Start with 10002000.
- `--append`: append instead of overwrite.
Notes:
- This only finds forwards within the same channel that reference your parent IDs (self-forwards). Many channels will yield zero.
- Global cross-channel forward discovery is not supported here (can be added as a separate mode).
- Without `--scan-limit`, the tool scans sequentially from newest backwards and logs progress every ~1000 messages.
### All valid forms (forwards)
- Base variables:
- CH = @handle or https://t.me/handle
- OUT = path to .csv
- Source: exactly one of S1 or S2
- S1: --ids "id1,id2,..."
- S2: --from-csv PATH
- Modes:
- M1: Sequential scan (omit --scan-limit)
- M2: Chunked concurrent scan (requires --scan-limit N; accepts --concurrency K and --chunk-size M)
- Optional date filters for both modes: [--start-date D] [--end-date D]
- Binary: [--append]
- Optional: [--session-name NAME] [--phone NUMBER]
- Enumerated permutations by mode, source, and append (2 modes × 2 sources × 2 append = 8 forms):
M1 + S1 + no append
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT [--start-date D] [--end-date D]
M1 + S1 + append
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --append [--start-date D] [--end-date D]
M1 + S2 + no append
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT [--start-date D] [--end-date D]
M1 + S2 + append
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --append [--start-date D] [--end-date D]
M2 + S1 + no append
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
M2 + S1 + append
./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
M2 + S2 + no append
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
M2 + S2 + append
./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
Optional add-ons valid for any form above:
- Add [--session-name NAME] and/or [--phone NUMBER]
---
## Analyze (reports and tagging)
Posts-only report + tagged CSV:
```zsh
./run_scraper.sh analyze \
-i data/messages.csv \
--channel @SourceChannel \
--tags-config config/tags.yaml \
--fixtures-csv data/fixtures.csv \
--write-augmented-csv
```
Outputs:
- `data/messages_report.md`
- `data/messages_tagged.csv`
Replies-only report + tagged CSV:
```zsh
./run_scraper.sh analyze \
-i data/replies.csv \
--channel "Replies - @SourceChannel" \
--tags-config config/tags.yaml \
--write-augmented-csv
```
Outputs:
- `data/replies_report.md`
- `data/replies_tagged.csv`
Combined (posts report augmented with replies):
```zsh
./run_scraper.sh analyze \
-i data/messages.csv \
--channel @SourceChannel \
--tags-config config/tags.yaml \
--replies-csv data/replies.csv \
--fixtures-csv data/fixtures.csv \
--write-augmented-csv \
--write-combined-csv \
--emoji-mode keep \
--emoji-boost \
--save-plots
```
Adds to posts dataset:
- `sentiment_compound` for posts (VADER)
- `replies_sentiment_mean` (avg reply sentiment per post)
- `replies_count_scraped` and `replies_top_tags` (rollup from replies)
Report sections include:
- Summary, top posts by views/forwards/replies
- Temporal distributions
- Per-tag engagement
- Per-tag sentiment (posts)
- Replies per-tag summary
- Per-tag sentiment (replies)
- Combined sentiment (posts + replies)
- Matchday cross-analysis (when `--fixtures-csv` is provided):
- Posts: on vs off matchdays (counts and sentiment shares)
- Posts engagement vs matchday (replies per post: total, mean, median, share of posts with replies)
- Replies: on vs off matchdays (counts and sentiment shares)
- Replies by parent matchday and by reply date are both shown; parent-based classification is recommended for engagement.
Notes:
- Analyze overwrites outputs; use `-o` to customize report filename if needed.
- Emoji handling: add `--emoji-mode keep|demojize|strip` (default keep). Optionally `--emoji-boost` to gently tilt scores when clearly positive/negative emojis are present.
- Add `--write-combined-csv` to emit a unified CSV of posts+replies with a `content_type` column.
### All valid forms (analyze)
- Base variables:
- IN = input CSV (posts or replies)
- Optional outputs/labels: [-o REPORT.md] [--channel @handle]
- Optional configs/data: [--tags-config config/tags.yaml] [--replies-csv REPLIES.csv] [--fixtures-csv FIXTURES.csv]
- Binary: [--write-augmented-csv]
- Core permutations across replies-csv, fixtures-csv, write-augmented-csv (2×2×2 = 8 forms):
1) No replies, no fixtures, no aug
./run_scraper.sh analyze -i IN
2) No replies, no fixtures, with aug
./run_scraper.sh analyze -i IN --write-augmented-csv
3) No replies, with fixtures, no aug
./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv
4) No replies, with fixtures, with aug
./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv --write-augmented-csv
5) With replies, no fixtures, no aug
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv
6) With replies, no fixtures, with aug
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --write-augmented-csv
7) With replies, with fixtures, no aug
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv
8) With replies, with fixtures, with aug
./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv --write-augmented-csv
Optional add-ons valid for any form above:
- Append [-o REPORT.md] to control output filename
- Append [--channel @handle] for title
- Append [--tags-config config/tags.yaml] to enable tagging and per-tag summaries
- Append [--emoji-mode keep|demojize|strip] and optionally [--emoji-boost]
- Append [--write-combined-csv] to produce a merged posts+replies CSV
- Append [--save-plots] to emit plots to the data folder
- Append [--sentiment-backend transformers] and [--transformers-model <name-or-path>] to use a local HF model instead of VADER
- Append [--export-transformers-details] to include `sentiment_label` and `sentiment_probs` in augmented/combined CSVs
- Append [--sentiment-backend gpt] and optionally [--gpt-model MODEL] [--gpt-base-url URL] [--gpt-batch-size K] to use a local GPT (Ollama) backend
- Plot sizing and label controls (daily charts):
- [--plot-width-scale FLOAT] [--plot-max-width INCHES] [--plot-height INCHES]
- [--activity-top-n N]
- [--labels-max-per-day N] [--labels-per-line N] [--labels-band-y FLOAT] [--labels-stagger-rows N] [--labels-annotate-mode ticks|all|ticks+top]
When fixtures are provided (`--fixtures-csv`):
- The report adds a "## Matchday cross-analysis" section with on vs off matchday tables.
- Plots include:
- daily_activity_stacked.png with match labels inside the chart
- daily_volume_and_sentiment.png (bars: volume; lines: pos%/neg%)
- matchday_sentiment_overall.png (time series on fixture days)
- matchday_posts_volume_vs_sentiment.png (scatter)
- The combined CSV (with `--write-combined-csv`) includes `is_matchday` and, for replies, `parent_is_matchday` when available.
- Replies are classified two ways: by reply date (`is_matchday` on the reply row) and by their parent post (`parent_is_matchday`). The latter better reflects matchday-driven engagement.
Emoji and plots examples:
```zsh
# Keep emojis (default) and boost for strong positive/negative emojis
./run_scraper.sh analyze -i data/messages.csv --emoji-mode keep --emoji-boost --save-plots
# Demojize to :smiling_face: tokens (helps some tokenizers), with boost
./run_scraper.sh analyze -i data/messages.csv --emoji-mode demojize --emoji-boost
# Strip emojis entirely (if they add noise)
./run_scraper.sh analyze -i data/messages.csv --emoji-mode strip --save-plots
# Use a transformers model for sentiment (will auto-download on first use unless a local path is provided).
# Tip: for an off-the-shelf sentiment head, try a fine-tuned model like SST-2:
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend transformers \
--transformers-model distilbert-base-uncased-finetuned-sst-2-english
## Local GPT backend (Ollama)
Use a local GPT model that returns JSON {label, confidence} per message; the analyzer maps this to a compound score and falls back to VADER on errors.
```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend gpt \
--gpt-model llama3 \
--gpt-base-url http://localhost:11434 \
--write-augmented-csv --write-combined-csv --save-plots
```
```
---
## Train a local transformers sentiment model
Prepare a labeled CSV with at least two columns: `message` and `label` (e.g., neg/neu/pos or 0/1/2).
Dont have one yet? Create a labeling set from your existing posts/replies:
```zsh
# Generate a CSV to annotate by hand (adds a blank 'label' column)
./.venv/bin/python -m src.make_labeling_set \
--posts-csv data/premier_league_update.csv \
--replies-csv data/premier_league_replies.csv \
--sample-size 1000 \
-o data/labeled_sentiment.csv
# Or via alias (after sourcing scripts/aliases.zsh)
make_label_set "$POSTS_CSV" "$REPLIES_CSV" data/labeled_sentiment.csv 1000
```
Then fine-tune:
```zsh
# Ensure the venv exists (run any ./run_scraper.sh command once), then:
./.venv/bin/python -m src.train_sentiment \
--train-csv data/labeled_sentiment.csv \
--text-col message \
--label-col label \
--model-name distilbert-base-uncased \
--output-dir models/sentiment-distilbert \
--epochs 3 --batch-size 16
```
Use it in analyze:
```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend transformers \
--transformers-model models/sentiment-distilbert
```
Export details (labels, probabilities) into CSVs:
```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend transformers \
--transformers-model models/sentiment-distilbert \
--export-transformers-details \
--write-augmented-csv --write-combined-csv
```
Notes:
- The analyzer maps model class probabilities to a VADER-like compound score in [-1, 1] for compatibility with the rest of the report.
- If the model has id2label including 'neg','neu','pos' labels, the mapping is more accurate; otherwise it defaults to pos - neg.
- GPU/Apple Silicon (MPS) will be used automatically if available.
Torch install note (macOS):
- `requirements.txt` uses conditional pins: `torch==2.3.1` for Python < 3.13 and `torch>=2.7.1` for Python ≥ 3.13. This keeps installs smooth on macOS. If you hit install issues, let us know.
## Evaluate a fine-tuned model
```zsh
./.venv/bin/python -m src.eval_sentiment \
--csv data/labeled_holdout.csv \
--text-col message \
--label-col label \
--model models/sentiment-distilbert
```
Prints accuracy, macro-precision/recall/F1, and a classification report.
## Fixtures (Premier League schedule via football-data.org)
Fetch fixtures between dates:
```zsh
./run_scraper.sh fixtures \
--start-date 2025-08-15 \
--end-date 2025-10-15 \
-o data/fixtures.csv
```
Notes:
- Requires `FOOTBALL_DATA_API_TOKEN` in `.env`.
- Output may be `.csv` or `.json` (by extension).
### All valid forms (fixtures)
- Base variables:
- SD = start date YYYY-MM-DD
- ED = end date YYYY-MM-DD
- OUT = output .csv or .json
Form:
./run_scraper.sh fixtures --start-date SD --end-date ED -o OUT
---
## Advanced recipes
Parallel replies + forwards with separate sessions:
```zsh
# Terminal 1 replies
./run_scraper.sh replies \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/replies.csv \
--min-replies 1 \
--concurrency 15 \
--resume \
--append \
--session-name "$HOME/.local/share/telethon_sessions/telegram_replies"
# Terminal 2 forwards
./run_scraper.sh forwards \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/forwards.csv \
--scan-limit 20000 \
--concurrency 10 \
--chunk-size 1500 \
--session-name "$HOME/.local/share/telethon_sessions/telegram_forwards"
```
Tuning for rate limits:
- If `[rate-limit]` logs are frequent, reduce `--concurrency` (start -3 to -5) and keep `--chunk-size` around 10002000.
- For replies, prioritize with `--min-replies 1` to avoid parents with zero replies.
Safety:
- Use `--append` with replies and `--resume` to avoid truncating and to dedupe.
- Forwards and scrape dont dedupe; prefer writing to a new file or dedupe after.
---
## Environment setup quick-start
Create `.env` (script will prompt if missing):
```
TELEGRAM_API_ID=123456
TELEGRAM_API_HASH=your_api_hash
# Optional defaults
TELEGRAM_SESSION_NAME=telegram
TELEGRAM_2FA_PASSWORD=your_2fa_password
FOOTBALL_DATA_API_TOKEN=your_token
```
First run will prompt for phone and code (and 2FA if enabled).
---
## Troubleshooting
- Empty replies file
- Ensure `-c` matches the channel in your posts CSV URLs.
- Use `--append` so the file isnt truncated before writing.
- “database is locked”
- Use unique `--session-name` per parallel process; store sessions outside iCloud Drive.
- Forwards empty
- Same-channel forwards are rare. This tool only finds self-forwards (not cross-channel).
- Analyze errors
- Ensure CSVs have expected columns. Posts: `id,date,message,...`; Replies: `parent_id,id,date,message,...`.
- Exit code 1 when starting
- Check the last log lines. Common causes: missing TELEGRAM_API_ID/HASH in `.env`, wrong channel handle vs CSV URLs, session file locked by another process (use distinct `--session-name`), or a bad output path.
---
## Quick aliases for daily runs (zsh) ⚡
Paste this section into your current shell or your `~/.zshrc` to get convenient Make-like commands.
### Project defaults (edit as needed)
```zsh
# Channel and files
export CH="https://t.me/Premier_League_Update"
export POSTS_CSV="data/premier_league_update.csv"
export REPLIES_CSV="data/premier_league_replies.csv"
export FORWARDS_CSV="data/premier_league_forwards.csv"
export TAGS_CFG="config/tags.yaml"
export FIXTURES_CSV="data/premier_league_schedule_2025-08-15_to_2025-10-15.csv"
# Sessions directory outside iCloud (avoid sqlite locks)
export SESSION_DIR="$HOME/.local/share/telethon_sessions"
mkdir -p "$SESSION_DIR"
```
### Aliases (zsh functions)
```zsh
# Fast replies: resume+append, prioritizes parents with replies, tuned concurrency
fast_replies() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local out="${3:-$REPLIES_CSV}"
local conc="${4:-15}"
local sess="${5:-$SESSION_DIR/telegram_replies}"
./run_scraper.sh replies \
-c "$ch" \
--from-csv "$posts" \
-o "$out" \
--min-replies 1 \
--concurrency "$conc" \
--resume \
--append \
--session-name "$sess"
}
# Chunked forwards: concurrent chunk scan with progress logs
chunked_forwards() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local out="${3:-$FORWARDS_CSV}"
local scan="${4:-20000}"
local conc="${5:-10}"
local chunk="${6:-1500}"
local sess="${7:-$SESSION_DIR/telegram_forwards}"
./run_scraper.sh forwards \
-c "$ch" \
--from-csv "$posts" \
-o "$out" \
--scan-limit "$scan" \
--concurrency "$conc" \
--chunk-size "$chunk" \
--append \
--session-name "$sess"
}
# Combined analyze: posts + replies + fixtures with tags; writes augmented CSVs
analyze_combined() {
local posts="${1:-$POSTS_CSV}"
local replies="${2:-$REPLIES_CSV}"
local tags="${3:-$TAGS_CFG}"
local fixtures="${4:-$FIXTURES_CSV}"
local ch="${5:-$CH}"
./run_scraper.sh analyze \
-i "$posts" \
--channel "$ch" \
--tags-config "$tags" \
--replies-csv "$replies" \
--fixtures-csv "$fixtures" \
--write-augmented-csv \
--write-combined-csv
}
# Emoji-aware analyze with sensible defaults (keep + boost)
analyze_emoji() {
local posts="${1:-$POSTS_CSV}"
local replies="${2:-$REPLIES_CSV}"
local tags="${3:-$TAGS_CFG}"
local fixtures="${4:-$FIXTURES_CSV}"
local ch="${5:-$CH}"
local mode="${6:-keep}" # keep | demojize | strip
./run_scraper.sh analyze \
-i "$posts" \
--channel "$ch" \
--tags-config "$tags" \
--replies-csv "$replies" \
--fixtures-csv "$fixtures" \
--write-augmented-csv \
--write-combined-csv \
--emoji-mode "$mode" \
--emoji-boost
}
# One-shot daily pipeline: fast replies then combined analyze
run_daily() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local replies="${3:-$REPLIES_CSV}"
local conc="${4:-15}"
fast_replies "$ch" "$posts" "$replies" "$conc" "$SESSION_DIR/telegram_replies"
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
}
# One-shot daily pipeline with forwards in parallel
run_daily_with_forwards() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local replies="${3:-$REPLIES_CSV}"
local forwards="${4:-$FORWARDS_CSV}"
local rep_conc="${5:-15}"
local f_scan="${6:-20000}"
local f_conc="${7:-10}"
local f_chunk="${8:-1500}"
local sess_r="${9:-$SESSION_DIR/telegram_replies}"
local sess_f="${10:-$SESSION_DIR/telegram_forwards}"
# Launch replies and forwards in parallel with separate sessions
local pid_r pid_f
fast_replies "$ch" "$posts" "$replies" "$rep_conc" "$sess_r" & pid_r=$!
chunked_forwards "$ch" "$posts" "$forwards" "$f_scan" "$f_conc" "$f_chunk" "$sess_f" & pid_f=$!
# Wait for completion and then analyze with emoji handling
wait $pid_r
wait $pid_f
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
}
```
### Usage
```zsh
# Use project defaults
fast_replies
chunked_forwards
analyze_combined
# Override on the fly (channel, files, or tuning)
fast_replies "https://t.me/AnotherChannel" data/other_posts.csv data/other_replies.csv 12
chunked_forwards "$CH" "$POSTS_CSV" data/alt_forwards.csv 30000 12 2000
analyze_combined data/other_posts.csv data/other_replies.csv "$TAGS_CFG" "$FIXTURES_CSV" "$CH"
```