Files
gitea-migration/TODO.md

119 lines
5.0 KiB
Markdown

# TODO — Phase 7.5 Nginx -> Caddy Consolidation
## Why this exists
This file captures the decisions and migration context for the one-time "phase 7.5"
work so we do not lose reasoning between sessions.
## What happened so far
1. The original `phase8_cutover.sh` was designed for one wildcard zone
(`*.${CADDY_DOMAIN}`), mainly for Gitea cutover.
2. The homelab currently has two active DNS zones in scope:
- `sintheus.com` (legacy services behind Nginx)
- `privacyindesign.com` (new Gitea public endpoint)
3. Decision made: run a one-time migration where a single Caddy instance serves
both zones, then gradually retire Nginx.
4. Implemented: `phase7_5_nginx_to_caddy.sh` to generate/deploy a multi-domain
Caddyfile and run canary/full rollout modes.
## Current design decisions
1. Public ingress should be HTTPS-only for all migrated hostnames.
2. Backend scheme is mixed for now:
- Keep `http://` upstream where service does not yet have TLS.
- Keep `https://` where already available.
3. End-to-end HTTPS is a target state, not an immediate requirement.
4. A strict toggle exists in phase 7.5:
- `--strict-backend-https` fails if any upstream is `http://`.
5. Canary-first rollout:
- first migration target is `tower.sintheus.com`.
6. Canary mode is additive:
- preserves existing Caddy routes
- updates only a managed canary block for `tower.sintheus.com`.
## Host map and backend TLS status
### Canary scope (default mode)
- `tower.sintheus.com -> https://192.168.1.82:443` (TLS backend; cert verify skipped)
- `${GITEA_DOMAIN} -> http://${UNRAID_GITEA_IP}:3000` (HTTP backend for now)
### Full migration scope
- `ai.sintheus.com -> http://192.168.1.82:8181`
- `photos.sintheus.com -> http://192.168.1.222:2283`
- `fin.sintheus.com -> http://192.168.1.233:8096`
- `disk.sintheus.com -> http://192.168.1.52:80`
- `pi.sintheus.com -> http://192.168.1.4:80`
- `plex.sintheus.com -> http://192.168.1.111:32400`
- `sync.sintheus.com -> http://192.168.1.119:8384`
- `syno.sintheus.com -> https://100.108.182.16:5001` (verify skipped)
- `tower.sintheus.com -> https://192.168.1.82:443` (verify skipped)
- `${GITEA_DOMAIN} -> http://${UNRAID_GITEA_IP}:3000`
## Definition of done (phase 7.5)
Phase 7.5 is done only when all are true:
1. Caddy is running on Unraid with generated multi-domain config.
2. Canary host `tower.sintheus.com` is reachable over HTTPS through Caddy.
3. Canary routing is proven by at least one path:
- `curl --resolve` tests, or
- split-DNS/hosts override, or
- intentional DNS cutover.
4. Legacy Nginx remains available for non-migrated hosts during canary.
5. No critical regressions observed for at least 24 hours on canary traffic.
## Definition of done (final state after full migration)
1. All selected domains route to Caddy through the intended ingress path:
- LAN-only: split-DNS/private resolution to Caddy, or
- public: DNS to WAN ingress that forwards 443 to Caddy.
2. Caddy serves valid certificates for both zones.
3. Functional checks pass for each service (UI load, API, websocket/streaming where relevant).
4. Nginx is no longer on the request path for migrated domains.
5. Long-term target: all backends upgraded to `https://` and strict mode passes.
## What remains to happen
1. Run canary:
- `./phase7_5_nginx_to_caddy.sh --mode=canary`
2. Route canary traffic to Caddy using one method:
- `curl --resolve` for zero-DNS-change testing, or
- split-DNS/private DNS, or
- explicit DNS cutover if desired.
3. Observe errors/latency/app behavior for at least 24 hours.
4. If canary is clean, run full:
- `./phase7_5_nginx_to_caddy.sh --mode=full`
5. Move remaining routes in batches (DNS or split-DNS, depending on ingress model).
6. Validate each app after each batch.
7. After everything is stable, plan Nginx retirement.
8. Later hardening pass:
- enable TLS on each backend service one by one
- flip each corresponding upstream to `https://`
- finally run `--strict-backend-https` and require it to pass.
## Risks and why mixed backend HTTP is acceptable short-term
1. Risk: backend HTTP is unencrypted on LAN.
- Mitigation: traffic stays on trusted local network, temporary state only.
2. Risk: if strict mode is enabled too early, rollout blocks.
- Mitigation: keep strict mode off until backend TLS coverage improves.
3. Risk: moving all DNS at once can create broad outage.
- Mitigation: canary-first and batch DNS cutover.
## Operational notes
1. If Caddyfile already exists, phase 7.5 backs it up as:
- `${CADDY_DATA_PATH}/Caddyfile.pre_phase7_5.<timestamp>`
2. Compose stack path for Caddy:
- `${UNRAID_COMPOSE_DIR}/caddy/docker-compose.yml`
3. Script does not change Cloudflare DNS records automatically.
- DNS updates are intentional/manual to keep blast radius controlled.
4. Do not set public Cloudflare proxied records to private `192.168.x.x` addresses.
5. Canary upsert behavior is domain-aware:
- if site block for the canary domain does not exist, it is added
- if site block exists, it is replaced in-place
- previous block content is printed in logs before replacement