26 KiB
Project command reference
This file lists all supported commands and practical permutations for ./run_scraper.sh, with short comments and tips. It mirrors the actual CLI flags in the code.
- Shell: zsh (macOS) — commands below are ready to paste.
- Env: A
.venvis created automatically; dependencies installed fromrequirements.txt. - Secrets: Create
.envwith TELEGRAM_API_ID and TELEGRAM_API_HASH; for fixtures also set FOOTBALL_DATA_API_TOKEN. - 2FA: If you use Telegram two-step verification, set TELEGRAM_2FA_PASSWORD in
.env(the shell wrapper doesn’t accept a flag for this). - Sessions: Telethon uses a SQLite session file (default
telegram.session). When running multiple tools in parallel, use distinct--session-namevalues.
Common conventions
- Channels
- Use either handle or URL:
-c @nameor-c https://t.me/name. - For replies, the channel must match the posts’ source in your CSV
urlcolumn.
- Use either handle or URL:
- Output behavior
- scrape/replies/forwards overwrite unless you pass
--append. - analyze always overwrites its outputs.
- scrape/replies/forwards overwrite unless you pass
- Rate-limits
- Replies/forwards log
[rate-limit]if Telegram asks you to wait. Reduce--concurrencyif frequent.
- Replies/forwards log
- Parallel runs
- Add
--session-name <unique>per process to avoid “database is locked”. Prefer sessions outside iCloud Drive.
- Add
Scrape (posts/messages)
Minimal (overwrite output):
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv
With date range and limit:
./run_scraper.sh scrape \
-c https://t.me/SomeChannel \
-o data/messages.jsonl \
--start-date 2025-01-01 \
--end-date 2025-03-31 \
--limit 500
Legacy offset date (deprecated; prefer --start-date):
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --offset-date 2025-01-01
Append to existing file and pass phone on first login:
./run_scraper.sh scrape \
-c @SomeChannel \
-o data/messages.csv \
--append \
--phone +15551234567
Use a custom session (useful in parallel):
./run_scraper.sh scrape -c @SomeChannel -o data/messages.csv --session-name telegram_scrape
Notes:
- Output format inferred by extension:
.csvor.jsonl/.ndjson. - Two-step verification: set TELEGRAM_2FA_PASSWORD in
.env(no CLI flag in the shell wrapper).
All valid forms (scrape)
Use one of the following combinations. Replace placeholders with your values.
-
Base variables:
- CH = @handle or https://t.me/handle
- OUT = path to .csv or .jsonl
- Optional value flags: [--limit N] [--session-name NAME] [--phone NUMBER]
-
Date filter permutations (4) × Append flag (2) × Limit presence (2) = 16 forms
- No dates, no append, no limit ./run_scraper.sh scrape -c CH -o OUT
- No dates, no append, with limit ./run_scraper.sh scrape -c CH -o OUT --limit N
- No dates, with append, no limit ./run_scraper.sh scrape -c CH -o OUT --append
- No dates, with append, with limit ./run_scraper.sh scrape -c CH -o OUT --append --limit N
- Start only, no append, no limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD
- Start only, no append, with limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --limit N
- Start only, with append, no limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append
- Start only, with append, with limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --append --limit N
- End only, no append, no limit ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD
- End only, no append, with limit ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --limit N
- End only, with append, no limit ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append
- End only, with append, with limit ./run_scraper.sh scrape -c CH -o OUT --end-date YYYY-MM-DD --append --limit N
- Start and end, no append, no limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD
- Start and end, no append, with limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --limit N
- Start and end, with append, no limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append
- Start and end, with append, with limit ./run_scraper.sh scrape -c CH -o OUT --start-date YYYY-MM-DD --end-date YYYY-MM-DD --append --limit N
Optional add-ons valid for any form above:
- Append [--session-name NAME] and/or [--phone NUMBER]
- Deprecated alternative to start-date: add [--offset-date YYYY-MM-DD]
Replies (fetch replies to posts)
From a posts CSV (fast path; skip posts with 0 replies in CSV):
./run_scraper.sh replies \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/replies.csv \
--min-replies 1 \
--concurrency 15 \
--resume \
--append
Using explicit message IDs:
./run_scraper.sh replies \
-c @SourceChannel \
--ids "123,456,789" \
-o data/replies.csv \
--concurrency 5 \
--append
IDs from a file (one per line) using zsh substitution:
IDS=$(tr '\n' ',' < parent_ids.txt | sed 's/,$//')
./run_scraper.sh replies -c @SourceChannel --ids "$IDS" -o data/replies.csv --concurrency 8 --append
Parallel-safe session name:
./run_scraper.sh replies -c @SourceChannel --from-csv data/messages.csv -o data/replies.csv --concurrency 12 --resume --append --session-name telegram_replies
What the flags do:
--from-csv PATHreads parent IDs from a CSV with anidcolumn (optionally filtered by--min-replies).--idsprovides a comma-separated list of parent IDs.--concurrency Kprocesses K parent IDs in parallel (default 5).--resumededupes by(parent_id,id)pairs already present in the output.--appendappends to output instead of overwriting.
Notes:
- The channel (
-c) must match the posts’ source in your CSV URLs (the tool warns on mismatch). - First login may require
--phone(interactive prompt). For 2FA, set TELEGRAM_2FA_PASSWORD in.env.
All valid forms (replies)
-
Base variables:
- CH = @handle or https://t.me/handle
- OUT = path to .csv
- Source: exactly one of S1 or S2
- S1: --ids "id1,id2,..."
- S2: --from-csv PATH [--min-replies N]
- Optional: [--concurrency K] [--session-name NAME] [--phone NUMBER]
- Binary: [--append], [--resume]
-
Enumerated binary permutations for each source (4 per source = 8 total):
S1 + no append + no resume ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT S1 + no append + resume ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --resume S1 + append + no resume ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append S1 + append + resume ./run_scraper.sh replies -c CH --ids "IDLIST" -o OUT --append --resume
S2 + no append + no resume ./run_scraper.sh replies -c CH --from-csv PATH -o OUT S2 + no append + resume ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --resume S2 + append + no resume ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append S2 + append + resume ./run_scraper.sh replies -c CH --from-csv PATH -o OUT --append --resume
Optional add-ons valid for any form above:
- Add [--concurrency K] to tune speed; recommended 8–20
- With S2 you may add [--min-replies N] to prioritize parents with replies
- Add [--session-name NAME] and/or [--phone NUMBER]
Forwards (same-channel forwards referencing posts)
Typical concurrent scan (best-effort; often zero results):
./run_scraper.sh forwards \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/forwards.csv \
--scan-limit 20000 \
--concurrency 10 \
--chunk-size 1500
With date filters (applied to scanned messages):
./run_scraper.sh forwards \
-c @SourceChannel \
--from-csv data/messages.csv \
-o data/forwards.csv \
--start-date 2025-01-01 \
--end-date 2025-03-31 \
--scan-limit 10000 \
--concurrency 8 \
--chunk-size 1000
Using explicit message IDs:
./run_scraper.sh forwards -c @SourceChannel --ids "100,200,300" -o data/forwards.csv --scan-limit 8000 --concurrency 6 --chunk-size 1000
Sequential mode (no chunking) by omitting --scan-limit:
./run_scraper.sh forwards -c @SourceChannel --from-csv data/messages.csv -o data/forwards.csv
What the flags do:
--scan-limit N: enables chunked, concurrent scanning of ~N recent message IDs.--concurrency K: number of id-chunks to scan in parallel (requires--scan-limit).--chunk-size M: approx. IDs per chunk (trade-off between balance/overhead). Start with 1000–2000.--append: append instead of overwrite.
Notes:
- This only finds forwards within the same channel that reference your parent IDs (self-forwards). Many channels will yield zero.
- Global cross-channel forward discovery is not supported here (can be added as a separate mode).
- Without
--scan-limit, the tool scans sequentially from newest backwards and logs progress every ~1000 messages.
All valid forms (forwards)
-
Base variables:
- CH = @handle or https://t.me/handle
- OUT = path to .csv
- Source: exactly one of S1 or S2
- S1: --ids "id1,id2,..."
- S2: --from-csv PATH
- Modes:
- M1: Sequential scan (omit --scan-limit)
- M2: Chunked concurrent scan (requires --scan-limit N; accepts --concurrency K and --chunk-size M)
- Optional date filters for both modes: [--start-date D] [--end-date D]
- Binary: [--append]
- Optional: [--session-name NAME] [--phone NUMBER]
-
Enumerated permutations by mode, source, and append (2 modes × 2 sources × 2 append = 8 forms):
M1 + S1 + no append ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT [--start-date D] [--end-date D] M1 + S1 + append ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --append [--start-date D] [--end-date D] M1 + S2 + no append ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT [--start-date D] [--end-date D] M1 + S2 + append ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --append [--start-date D] [--end-date D]
M2 + S1 + no append ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D] M2 + S1 + append ./run_scraper.sh forwards -c CH --ids "IDLIST" -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D] M2 + S2 + no append ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D] M2 + S2 + append ./run_scraper.sh forwards -c CH --from-csv PATH -o OUT --scan-limit N --append [--concurrency K] [--chunk-size M] [--start-date D] [--end-date D]
Optional add-ons valid for any form above:
- Add [--session-name NAME] and/or [--phone NUMBER]
Analyze (reports and tagging)
Posts-only report + tagged CSV:
./run_scraper.sh analyze \
-i data/messages.csv \
--channel @SourceChannel \
--tags-config config/tags.yaml \
--fixtures-csv data/fixtures.csv \
--write-augmented-csv
Outputs:
data/messages_report.mddata/messages_tagged.csv
Replies-only report + tagged CSV:
./run_scraper.sh analyze \
-i data/replies.csv \
--channel "Replies - @SourceChannel" \
--tags-config config/tags.yaml \
--write-augmented-csv
Outputs:
data/replies_report.mddata/replies_tagged.csv
Combined (posts report augmented with replies):
./run_scraper.sh analyze \
-i data/messages.csv \
--channel @SourceChannel \
--tags-config config/tags.yaml \
--replies-csv data/replies.csv \
--fixtures-csv data/fixtures.csv \
--write-augmented-csv \
--write-combined-csv \
--emoji-mode keep \
--emoji-boost \
--save-plots
Adds to posts dataset:
sentiment_compoundfor posts (VADER)replies_sentiment_mean(avg reply sentiment per post)replies_count_scrapedandreplies_top_tags(rollup from replies)
Report sections include:
- Summary, top posts by views/forwards/replies
- Temporal distributions
- Per-tag engagement
- Per-tag sentiment (posts)
- Replies per-tag summary
- Per-tag sentiment (replies)
- Combined sentiment (posts + replies)
- Matchday cross-analysis (when
--fixtures-csvis provided):- Posts: on vs off matchdays (counts and sentiment shares)
- Posts engagement vs matchday (replies per post: total, mean, median, share of posts with replies)
- Replies: on vs off matchdays (counts and sentiment shares)
- Replies by parent matchday and by reply date are both shown; parent-based classification is recommended for engagement.
Notes:
- Analyze overwrites outputs; use
-oto customize report filename if needed. - Emoji handling: add
--emoji-mode keep|demojize|strip(default keep). Optionally--emoji-boostto gently tilt scores when clearly positive/negative emojis are present. - Add
--write-combined-csvto emit a unified CSV of posts+replies with acontent_typecolumn.
All valid forms (analyze)
-
Base variables:
- IN = input CSV (posts or replies)
- Optional outputs/labels: [-o REPORT.md] [--channel @handle]
- Optional configs/data: [--tags-config config/tags.yaml] [--replies-csv REPLIES.csv] [--fixtures-csv FIXTURES.csv]
- Binary: [--write-augmented-csv]
-
Core permutations across replies-csv, fixtures-csv, write-augmented-csv (2×2×2 = 8 forms):
- No replies, no fixtures, no aug ./run_scraper.sh analyze -i IN
- No replies, no fixtures, with aug ./run_scraper.sh analyze -i IN --write-augmented-csv
- No replies, with fixtures, no aug ./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv
- No replies, with fixtures, with aug ./run_scraper.sh analyze -i IN --fixtures-csv FIXTURES.csv --write-augmented-csv
- With replies, no fixtures, no aug ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv
- With replies, no fixtures, with aug ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --write-augmented-csv
- With replies, with fixtures, no aug ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv
- With replies, with fixtures, with aug ./run_scraper.sh analyze -i IN --replies-csv REPLIES.csv --fixtures-csv FIXTURES.csv --write-augmented-csv
Optional add-ons valid for any form above:
- Append [-o REPORT.md] to control output filename
- Append [--channel @handle] for title
- Append [--tags-config config/tags.yaml] to enable tagging and per-tag summaries
- Append [--emoji-mode keep|demojize|strip] and optionally [--emoji-boost]
- Append [--write-combined-csv] to produce a merged posts+replies CSV
- Append [--save-plots] to emit plots to the data folder
- Append [--sentiment-backend transformers] and [--transformers-model ] to use a local HF model instead of VADER
- Append [--export-transformers-details] to include
sentiment_labelandsentiment_probsin augmented/combined CSVs - Append [--sentiment-backend gpt] and optionally [--gpt-model MODEL] [--gpt-base-url URL] [--gpt-batch-size K] to use a local GPT (Ollama) backend
- Plot sizing and label controls (daily charts):
- [--plot-width-scale FLOAT] [--plot-max-width INCHES] [--plot-height INCHES]
- [--activity-top-n N]
- [--labels-max-per-day N] [--labels-per-line N] [--labels-band-y FLOAT] [--labels-stagger-rows N] [--labels-annotate-mode ticks|all|ticks+top]
When fixtures are provided (--fixtures-csv):
- The report adds a "## Matchday cross-analysis" section with on vs off matchday tables.
- Plots include:
- daily_activity_stacked.png with match labels inside the chart
- daily_volume_and_sentiment.png (bars: volume; lines: pos%/neg%)
- matchday_sentiment_overall.png (time series on fixture days)
- matchday_posts_volume_vs_sentiment.png (scatter)
- The combined CSV (with
--write-combined-csv) includesis_matchdayand, for replies,parent_is_matchdaywhen available. - Replies are classified two ways: by reply date (
is_matchdayon the reply row) and by their parent post (parent_is_matchday). The latter better reflects matchday-driven engagement.
Emoji and plots examples:
# Keep emojis (default) and boost for strong positive/negative emojis
./run_scraper.sh analyze -i data/messages.csv --emoji-mode keep --emoji-boost --save-plots
# Demojize to :smiling_face: tokens (helps some tokenizers), with boost
./run_scraper.sh analyze -i data/messages.csv --emoji-mode demojize --emoji-boost
# Strip emojis entirely (if they add noise)
./run_scraper.sh analyze -i data/messages.csv --emoji-mode strip --save-plots
# Use a transformers model for sentiment (will auto-download on first use unless a local path is provided).
# Tip: for an off-the-shelf sentiment head, try a fine-tuned model like SST-2:
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend transformers \
--transformers-model distilbert-base-uncased-finetuned-sst-2-english
## Local GPT backend (Ollama)
Use a local GPT model that returns JSON {label, confidence} per message; the analyzer maps this to a compound score and falls back to VADER on errors.
```zsh
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend gpt \
--gpt-model llama3 \
--gpt-base-url http://localhost:11434 \
--write-augmented-csv --write-combined-csv --save-plots
---
## Train a local transformers sentiment model
Prepare a labeled CSV with at least two columns: `message` and `label` (e.g., neg/neu/pos or 0/1/2).
Don’t have one yet? Create a labeling set from your existing posts/replies:
```zsh
# Generate a CSV to annotate by hand (adds a blank 'label' column)
./.venv/bin/python -m src.make_labeling_set \
--posts-csv data/premier_league_update.csv \
--replies-csv data/premier_league_replies.csv \
--sample-size 1000 \
-o data/labeled_sentiment.csv
# Or via alias (after sourcing scripts/aliases.zsh)
make_label_set "$POSTS_CSV" "$REPLIES_CSV" data/labeled_sentiment.csv 1000
Then fine-tune:
# Ensure the venv exists (run any ./run_scraper.sh command once), then:
./.venv/bin/python -m src.train_sentiment \
--train-csv data/labeled_sentiment.csv \
--text-col message \
--label-col label \
--model-name distilbert-base-uncased \
--output-dir models/sentiment-distilbert \
--epochs 3 --batch-size 16
Use it in analyze:
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend transformers \
--transformers-model models/sentiment-distilbert
Export details (labels, probabilities) into CSVs:
./run_scraper.sh analyze -i data/messages.csv --replies-csv data/replies.csv \
--sentiment-backend transformers \
--transformers-model models/sentiment-distilbert \
--export-transformers-details \
--write-augmented-csv --write-combined-csv
Notes:
- The analyzer maps model class probabilities to a VADER-like compound score in [-1, 1] for compatibility with the rest of the report.
- If the model has id2label including 'neg','neu','pos' labels, the mapping is more accurate; otherwise it defaults to pos - neg.
- GPU/Apple Silicon (MPS) will be used automatically if available.
Torch install note (macOS):
requirements.txtuses conditional pins:torch==2.3.1for Python < 3.13 andtorch>=2.7.1for Python ≥ 3.13. This keeps installs smooth on macOS. If you hit install issues, let us know.
Evaluate a fine-tuned model
./.venv/bin/python -m src.eval_sentiment \
--csv data/labeled_holdout.csv \
--text-col message \
--label-col label \
--model models/sentiment-distilbert
Prints accuracy, macro-precision/recall/F1, and a classification report.
Fixtures (Premier League schedule via football-data.org)
Fetch fixtures between dates:
./run_scraper.sh fixtures \
--start-date 2025-08-15 \
--end-date 2025-10-15 \
-o data/fixtures.csv
Notes:
- Requires
FOOTBALL_DATA_API_TOKENin.env. - Output may be
.csvor.json(by extension).
All valid forms (fixtures)
- Base variables:
- SD = start date YYYY-MM-DD
- ED = end date YYYY-MM-DD
- OUT = output .csv or .json
Form: ./run_scraper.sh fixtures --start-date SD --end-date ED -o OUT
Advanced recipes
Parallel replies + forwards with separate sessions:
# Terminal 1 – replies
./run_scraper.sh replies \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/replies.csv \
--min-replies 1 \
--concurrency 15 \
--resume \
--append \
--session-name "$HOME/.local/share/telethon_sessions/telegram_replies"
# Terminal 2 – forwards
./run_scraper.sh forwards \
-c https://t.me/SourceChannel \
--from-csv data/messages.csv \
-o data/forwards.csv \
--scan-limit 20000 \
--concurrency 10 \
--chunk-size 1500 \
--session-name "$HOME/.local/share/telethon_sessions/telegram_forwards"
Tuning for rate limits:
- If
[rate-limit]logs are frequent, reduce--concurrency(start -3 to -5) and keep--chunk-sizearound 1000–2000. - For replies, prioritize with
--min-replies 1to avoid parents with zero replies.
Safety:
- Use
--appendwith replies and--resumeto avoid truncating and to dedupe. - Forwards and scrape don’t dedupe; prefer writing to a new file or dedupe after.
Environment setup quick-start
Create .env (script will prompt if missing):
TELEGRAM_API_ID=123456
TELEGRAM_API_HASH=your_api_hash
# Optional defaults
TELEGRAM_SESSION_NAME=telegram
TELEGRAM_2FA_PASSWORD=your_2fa_password
FOOTBALL_DATA_API_TOKEN=your_token
First run will prompt for phone and code (and 2FA if enabled).
Troubleshooting
- Empty replies file
- Ensure
-cmatches the channel in your posts CSV URLs. - Use
--appendso the file isn’t truncated before writing.
- Ensure
- “database is locked”
- Use unique
--session-nameper parallel process; store sessions outside iCloud Drive.
- Use unique
- Forwards empty
- Same-channel forwards are rare. This tool only finds self-forwards (not cross-channel).
- Analyze errors
- Ensure CSVs have expected columns. Posts:
id,date,message,...; Replies:parent_id,id,date,message,....
- Ensure CSVs have expected columns. Posts:
- Exit code 1 when starting
- Check the last log lines. Common causes: missing TELEGRAM_API_ID/HASH in
.env, wrong channel handle vs CSV URLs, session file locked by another process (use distinct--session-name), or a bad output path.
- Check the last log lines. Common causes: missing TELEGRAM_API_ID/HASH in
Quick aliases for daily runs (zsh) ⚡
Paste this section into your current shell or your ~/.zshrc to get convenient Make-like commands.
Project defaults (edit as needed)
# Channel and files
export CH="https://t.me/Premier_League_Update"
export POSTS_CSV="data/premier_league_update.csv"
export REPLIES_CSV="data/premier_league_replies.csv"
export FORWARDS_CSV="data/premier_league_forwards.csv"
export TAGS_CFG="config/tags.yaml"
export FIXTURES_CSV="data/premier_league_schedule_2025-08-15_to_2025-10-15.csv"
# Sessions directory outside iCloud (avoid sqlite locks)
export SESSION_DIR="$HOME/.local/share/telethon_sessions"
mkdir -p "$SESSION_DIR"
Aliases (zsh functions)
# Fast replies: resume+append, prioritizes parents with replies, tuned concurrency
fast_replies() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local out="${3:-$REPLIES_CSV}"
local conc="${4:-15}"
local sess="${5:-$SESSION_DIR/telegram_replies}"
./run_scraper.sh replies \
-c "$ch" \
--from-csv "$posts" \
-o "$out" \
--min-replies 1 \
--concurrency "$conc" \
--resume \
--append \
--session-name "$sess"
}
# Chunked forwards: concurrent chunk scan with progress logs
chunked_forwards() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local out="${3:-$FORWARDS_CSV}"
local scan="${4:-20000}"
local conc="${5:-10}"
local chunk="${6:-1500}"
local sess="${7:-$SESSION_DIR/telegram_forwards}"
./run_scraper.sh forwards \
-c "$ch" \
--from-csv "$posts" \
-o "$out" \
--scan-limit "$scan" \
--concurrency "$conc" \
--chunk-size "$chunk" \
--append \
--session-name "$sess"
}
# Combined analyze: posts + replies + fixtures with tags; writes augmented CSVs
analyze_combined() {
local posts="${1:-$POSTS_CSV}"
local replies="${2:-$REPLIES_CSV}"
local tags="${3:-$TAGS_CFG}"
local fixtures="${4:-$FIXTURES_CSV}"
local ch="${5:-$CH}"
./run_scraper.sh analyze \
-i "$posts" \
--channel "$ch" \
--tags-config "$tags" \
--replies-csv "$replies" \
--fixtures-csv "$fixtures" \
--write-augmented-csv \
--write-combined-csv
}
# Emoji-aware analyze with sensible defaults (keep + boost)
analyze_emoji() {
local posts="${1:-$POSTS_CSV}"
local replies="${2:-$REPLIES_CSV}"
local tags="${3:-$TAGS_CFG}"
local fixtures="${4:-$FIXTURES_CSV}"
local ch="${5:-$CH}"
local mode="${6:-keep}" # keep | demojize | strip
./run_scraper.sh analyze \
-i "$posts" \
--channel "$ch" \
--tags-config "$tags" \
--replies-csv "$replies" \
--fixtures-csv "$fixtures" \
--write-augmented-csv \
--write-combined-csv \
--emoji-mode "$mode" \
--emoji-boost
}
# One-shot daily pipeline: fast replies then combined analyze
run_daily() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local replies="${3:-$REPLIES_CSV}"
local conc="${4:-15}"
fast_replies "$ch" "$posts" "$replies" "$conc" "$SESSION_DIR/telegram_replies"
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
}
# One-shot daily pipeline with forwards in parallel
run_daily_with_forwards() {
local ch="${1:-$CH}"
local posts="${2:-$POSTS_CSV}"
local replies="${3:-$REPLIES_CSV}"
local forwards="${4:-$FORWARDS_CSV}"
local rep_conc="${5:-15}"
local f_scan="${6:-20000}"
local f_conc="${7:-10}"
local f_chunk="${8:-1500}"
local sess_r="${9:-$SESSION_DIR/telegram_replies}"
local sess_f="${10:-$SESSION_DIR/telegram_forwards}"
# Launch replies and forwards in parallel with separate sessions
local pid_r pid_f
fast_replies "$ch" "$posts" "$replies" "$rep_conc" "$sess_r" & pid_r=$!
chunked_forwards "$ch" "$posts" "$forwards" "$f_scan" "$f_conc" "$f_chunk" "$sess_f" & pid_f=$!
# Wait for completion and then analyze with emoji handling
wait $pid_r
wait $pid_f
analyze_emoji "$posts" "$replies" "$TAGS_CFG" "$FIXTURES_CSV" "$ch" keep
}
Usage
# Use project defaults
fast_replies
chunked_forwards
analyze_combined
# Override on the fly (channel, files, or tuning)
fast_replies "https://t.me/AnotherChannel" data/other_posts.csv data/other_replies.csv 12
chunked_forwards "$CH" "$POSTS_CSV" data/alt_forwards.csv 30000 12 2000
analyze_combined data/other_posts.csv data/other_replies.csv "$TAGS_CFG" "$FIXTURES_CSV" "$CH"