feat: add runner conversion scripts and strengthen cutover automation

2026-03-04 13:32:06 -06:00
parent e624885bb9
commit c2087d5087
43 changed files with 6995 additions and 42 deletions
--- a/runners-conversion/augur/README.md
+++ b/runners-conversion/augur/README.md
@@ -0,0 +1,416 @@
+# Self-Hosted GitHub Actions Runner (Docker)
+
+Run GitHub Actions CI on your own Linux server instead of GitHub-hosted runners.
+Eliminates laptop CPU burden, avoids runner-minute quotas, and gives faster feedback.
+
+## How It Works
+
+Each runner container:
+1. Starts up, generates a short-lived registration token from your GitHub PAT
+2. Registers with GitHub in **ephemeral mode** (one job per lifecycle)
+3. Picks up a CI job, executes it, and exits
+4. Docker's `restart: unless-stopped` brings it back for the next job
+
+## Prerequisites
+
+- Docker Engine 24+ and Docker Compose v2
+- A GitHub Personal Access Token (classic) with **`repo`** and **`read:packages`** scopes
+- Network access to `github.com`, `api.github.com`, and `ghcr.io`
+
+## One-Time GitHub Setup
+
+Before deploying, the repository needs write permissions for the image build workflow.
+
+### Enable GHCR image builds
+
+The `build-runner-image.yml` workflow pushes Docker images to GHCR using the
+`GITHUB_TOKEN`. By default, this token is read-only and the workflow will fail
+silently (zero steps executed, no runner assigned).
+
+Fix by allowing write permissions for Actions workflows:
+
+```bash
+gh api -X PUT repos/OWNER/REPO/actions/permissions/workflow \
+  -f default_workflow_permissions=write \
+  -F can_approve_pull_request_reviews=false
+```
+
+Alternatively, keep read-only defaults and create a dedicated PAT secret with
+`write:packages` scope, then reference it in the workflow instead of `GITHUB_TOKEN`.
+
+### Build the runner image
+
+Trigger the GHCR image build (first time and whenever Dockerfile/entrypoint changes):
+
+```bash
+gh workflow run build-runner-image.yml
+```
+
+Wait for the workflow to complete (~5 min):
+
+```bash
+gh run list --workflow=build-runner-image.yml --limit=1
+```
+
+The image is also rebuilt automatically:
+- On push to `main` when `infra/runners/Dockerfile` or `entrypoint.sh` changes
+- Weekly (Monday 06:00 UTC) to pick up OS patches and runner agent updates
+
+## Deploy on Your Server
+
+### Choose an image source
+
+| Method | Files needed on server | Registry auth? | Best for |
+|--------|----------------------|----------------|----------|
+| **Self-hosted registry** | `docker-compose.yml`, `.env`, `envs/augur.env` | No (your network) | Production — push once, pull from any machine |
+| **GHCR** | `docker-compose.yml`, `.env`, `envs/augur.env` | Yes (`docker login ghcr.io`) | GitHub-native workflow |
+| **Build locally** | All 5 files (+ `Dockerfile`, `entrypoint.sh`) | No | Quick start, no registry needed |
+
+### Option A: Self-hosted registry (recommended)
+
+For the full end-to-end workflow (build image on your Mac, push to Unraid registry,
+start runner), see the [CI Workflow Guide](../../docs/ci-workflows.md#lifecycle-2-offload-ci-to-a-server-unraid).
+
+The private Docker registry is configured at `infra/registry/`. It listens on port 5000,
+accessible from the LAN. Docker treats `localhost` registries as insecure by default —
+no `daemon.json` changes needed on the server. To push from another machine, add
+`<UNRAID_IP>:5000` to `insecure-registries` in that machine's Docker daemon config.
+
+### Option B: GHCR
+
+Requires the `build-runner-image.yml` workflow to have run successfully
+(see [One-Time GitHub Setup](#one-time-github-setup)).
+
+```bash
+# 1. Copy environment templates
+cp .env.example .env
+cp envs/augur.env.example envs/augur.env
+
+# 2. Edit .env — set your GITHUB_PAT
+# 3. Edit envs/augur.env — set REPO_URL, RUNNER_NAME, resource limits
+
+# 4. Authenticate Docker with GHCR (one-time, persists to ~/.docker/config.json)
+echo "$GITHUB_PAT" | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin
+
+# 5. Pull and start
+docker compose pull
+docker compose up -d
+
+# 6. Verify runner is registered
+docker compose ps
+docker compose logs -f runner-augur
+```
+
+### Option C: Build locally
+
+No registry needed — builds the image directly on the target machine.
+Requires `Dockerfile` and `entrypoint.sh` alongside the compose file.
+
+```bash
+# 1. Copy environment templates
+cp .env.example .env
+cp envs/augur.env.example envs/augur.env
+
+# 2. Edit .env — set your GITHUB_PAT
+# 3. Edit envs/augur.env — set REPO_URL, RUNNER_NAME, resource limits
+
+# 4. Build and start
+docker compose up -d --build
+
+# 5. Verify runner is registered
+docker compose ps
+docker compose logs -f runner-augur
+```
+
+### Verify the runner is online in GitHub
+
+```bash
+gh api repos/OWNER/REPO/actions/runners \
+  --jq '.runners[] | {name, status, labels: [.labels[].name]}'
+```
+
+## Activate Self-Hosted CI
+
+Set the repository variable `CI_RUNS_ON` so the CI workflow targets your runner:
+
+```bash
+gh variable set CI_RUNS_ON --body '["self-hosted", "Linux", "X64"]'
+```
+
+To revert to GitHub-hosted runners:
+```bash
+gh variable delete CI_RUNS_ON
+```
+
+## Configuration
+
+### Shared Config (`.env`)
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `GITHUB_PAT` | Yes | GitHub PAT with `repo` + `read:packages` scope |
+
+### Per-Repo Config (`envs/<repo>.env`)
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `REPO_URL` | Yes | — | Full GitHub repository URL |
+| `RUNNER_NAME` | Yes | — | Unique runner name within the repo |
+| `RUNNER_LABELS` | No | `self-hosted,Linux,X64` | Comma-separated runner labels |
+| `RUNNER_GROUP` | No | `default` | Runner group |
+| `RUNNER_IMAGE` | No | `ghcr.io/aiinfuseds/augur-runner:latest` | Docker image to use |
+| `RUNNER_CPUS` | No | `6` | CPU limit for the container |
+| `RUNNER_MEMORY` | No | `12G` | Memory limit for the container |
+
+## Adding More Repos
+
+1. Copy the per-repo env template:
+   ```bash
+   cp envs/augur.env.example envs/myrepo.env
+   ```
+
+2. Edit `envs/myrepo.env` — set `REPO_URL`, `RUNNER_NAME`, and resource limits.
+
+3. Add a service block to `docker-compose.yml`:
+   ```yaml
+   runner-myrepo:
+     image: ${RUNNER_IMAGE:-ghcr.io/aiinfuseds/augur-runner:latest}
+     build: .
+     env_file:
+       - .env
+       - envs/myrepo.env
+     init: true
+     read_only: true
+     tmpfs:
+       - /tmp:size=2G
+     security_opt:
+       - no-new-privileges:true
+     stop_grace_period: 5m
+     deploy:
+       resources:
+         limits:
+           cpus: "${RUNNER_CPUS:-6}"
+           memory: "${RUNNER_MEMORY:-12G}"
+     restart: unless-stopped
+     healthcheck:
+       test: ["CMD", "pgrep", "-f", "Runner.Listener"]
+       interval: 30s
+       timeout: 5s
+       retries: 3
+       start_period: 30s
+     logging:
+       driver: json-file
+       options:
+         max-size: "50m"
+         max-file: "3"
+     volumes:
+       - myrepo-work:/home/runner/_work
+   ```
+
+4. Add the volume at the bottom of `docker-compose.yml`:
+   ```yaml
+   volumes:
+     augur-work:
+     myrepo-work:
+   ```
+
+5. Start: `docker compose up -d`
+
+## Scaling
+
+Run multiple concurrent runners for the same repo:
+
+```bash
+# Scale to 3 runners for augur
+docker compose up -d --scale runner-augur=3
+```
+
+Each container gets a unique runner name (Docker appends a suffix).
+Set `RUNNER_NAME` to a base name like `unraid-augur` — scaled instances become
+`unraid-augur-1`, `unraid-augur-2`, etc.
+
+## Resource Tuning
+
+Each repo can have different resource limits in its env file:
+
+```env
+# Lightweight repo (linting only)
+RUNNER_CPUS=2
+RUNNER_MEMORY=4G
+
+# Heavy repo (Go builds + extensive tests)
+RUNNER_CPUS=8
+RUNNER_MEMORY=16G
+```
+
+### tmpfs Sizing
+
+The `/tmp` tmpfs defaults to 2G. If your CI writes large temp files,
+increase it in `docker-compose.yml`:
+
+```yaml
+tmpfs:
+  - /tmp:size=4G
+```
+
+## Monitoring
+
+```bash
+# Container status and health
+docker compose ps
+
+# Live logs
+docker compose logs -f runner-augur
+
+# Last 50 log lines
+docker compose logs --tail 50 runner-augur
+
+# Resource usage
+docker stats runner-augur
+```
+
+## Updating the Runner Image
+
+To pull the latest GHCR image:
+```bash
+docker compose pull
+docker compose up -d
+```
+
+To rebuild locally:
+```bash
+docker compose build
+docker compose up -d
+```
+
+### Using a Self-Hosted Registry
+
+See the [CI Workflow Guide](../../docs/ci-workflows.md#lifecycle-2-offload-ci-to-a-server-unraid)
+for the full build-push-start workflow with a self-hosted registry.
+
+## Troubleshooting
+
+### Image build workflow fails with zero steps
+
+The `build-runner-image.yml` workflow needs `packages: write` permission.
+If the repo's default workflow permissions are read-only, the job fails
+instantly (0 steps, no runner assigned). See [One-Time GitHub Setup](#one-time-github-setup).
+
+### `docker compose pull` returns "access denied" or 403
+
+The GHCR package inherits the repository's visibility. For private repos,
+authenticate Docker first:
+
+```bash
+echo "$GITHUB_PAT" | docker login ghcr.io -u USERNAME --password-stdin
+```
+
+Or make the package public:
+```bash
+gh api -X PATCH /user/packages/container/augur-runner -f visibility=public
+```
+
+Or skip GHCR entirely and build locally: `docker compose build`.
+
+### Runner doesn't appear in GitHub
+
+1. Check logs: `docker compose logs runner-augur`
+2. Verify `GITHUB_PAT` has `repo` scope
+3. Verify `REPO_URL` is correct (full HTTPS URL)
+4. Check network: `docker compose exec runner-augur curl -s https://api.github.com`
+
+### Runner appears "offline"
+
+The runner may have exited after a job. Check:
+```bash
+docker compose ps          # Is the container running?
+docker compose restart runner-augur  # Force restart
+```
+
+### OOM (Out of Memory) kills
+
+Increase `RUNNER_MEMORY` in the per-repo env file:
+```env
+RUNNER_MEMORY=16G
+```
+
+Then: `docker compose up -d`
+
+### Stale/ghost runners in GitHub
+
+Ephemeral runners deregister automatically after each job. If a container
+was killed ungracefully (power loss, `docker kill`), the runner may appear
+stale. It will auto-expire after a few hours, or remove manually:
+
+```bash
+# List runners
+gh api repos/OWNER/REPO/actions/runners --jq '.runners[] | {id, name, status}'
+
+# Remove stale runner by ID
+gh api -X DELETE repos/OWNER/REPO/actions/runners/RUNNER_ID
+```
+
+### Disk space
+
+Check work directory volume usage:
+```bash
+docker system df -v
+```
+
+Clean up unused volumes:
+```bash
+docker compose down -v   # Remove work volumes
+docker volume prune      # Remove all unused volumes
+```
+
+## Unraid Notes
+
+- **Docker login persistence**: `docker login ghcr.io` writes credentials to
+  `/root/.docker/config.json`. On Unraid, `/root` is on the USB flash drive
+  and persists across reboots. Verify with `cat /root/.docker/config.json`
+  after login.
+- **Compose file location**: Place the 3 files (`docker-compose.yml`, `.env`,
+  `envs/augur.env`) in a share directory (e.g., `/mnt/user/appdata/augur-runner/`).
+- **Alternative to GHCR**: If you don't want to deal with registry auth on Unraid,
+  copy the `Dockerfile` and `entrypoint.sh` alongside the compose file and use
+  `docker compose up -d --build` instead. No registry needed.
+
+## Security
+
+| Measure | Description |
+|---------|-------------|
+| Ephemeral mode | Fresh runner state per job — no cross-job contamination |
+| PAT scope isolation | PAT generates a short-lived registration token; PAT never touches the runner agent |
+| Non-root user | Runner process runs as UID 1000, not root |
+| no-new-privileges | Prevents privilege escalation via setuid/setgid binaries |
+| tini (PID 1) | Proper signal forwarding and zombie process reaping |
+| Log rotation | Prevents disk exhaustion from verbose CI output (50MB x 3 files) |
+
+### PAT Scope
+
+Use the minimum scope required:
+- **Classic token**: `repo` + `read:packages` scopes
+- **Fine-grained token**: Repository access → Only select repositories → Read and Write for Administration
+
+### Network Considerations
+
+The runner container needs outbound access to:
+- `github.com` (clone repos, download actions)
+- `api.github.com` (registration, status)
+- `ghcr.io` (pull runner image — only if using GHCR)
+- Package registries (`proxy.golang.org`, `registry.npmjs.org`, etc.)
+
+No inbound ports are required.
+
+## Stopping and Removing
+
+```bash
+# Stop runners (waits for stop_grace_period)
+docker compose down
+
+# Stop and remove work volumes
+docker compose down -v
+
+# Stop, remove volumes, and delete the locally built image
+docker compose down -v --rmi local
+```