diff --git a/.gitea/README.md b/.gitea/README.md new file mode 100644 index 0000000..8838ab7 --- /dev/null +++ b/.gitea/README.md @@ -0,0 +1,291 @@ +# CI / Build Pipeline + +This directory contains the gitea Actions workflows and the supporting +documentation for opencode-devbox's CI. If you're investigating *why* +the build pipeline is shaped the way it is, you're in the right place. + +## Workflows in this directory + +| File | Trigger | Role | +|---|---|---| +| [`workflows/docker-publish.yml`](workflows/docker-publish.yml) | `push: tags: v*` | **Production release pipeline.** Multi-arch build of all four variants (`base`, `omos`, `with-pi`, `omos-with-pi`), publish to Docker Hub, update Hub description. ~165–180 min wall clock. | +| [`workflows/docker-publish-split.yml`](workflows/docker-publish-split.yml) | `workflow_dispatch` (manual) | **Experimental split-base pipeline.** Two-phase build: shared `base-` published once, then four thin variant deltas. Estimated ~30–40 min on cache hit, ~70–90 min when base needs rebuilding. Not yet validated end-to-end; once 1–2 dispatch test runs prove it, this will take over `on: push: tags: v*` and `docker-publish.yml` will be retired. | +| [`workflows/validate.yml`](workflows/validate.yml) | `push: branches: main` + PR | **Lightweight gate.** amd64-only smoke test of all four variants + `DOCKER_HUB.md` sync check. ~30 min. Fires on every push to `main`. | + +## Why two release pipelines exist + +opencode-devbox publishes **four image variants** (`base`, `omos`, `with-pi`, `omos-with-pi`) × **two architectures** (amd64, arm64) = **eight image tags per release**. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~3–5x slower than native). + +The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original `Dockerfile` was a single multi-stage build with `INSTALL_*` build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated `RUN` produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build. + +Two improvements were considered: + +1. **Reorder the original Dockerfile** so all variant-gated RUNs land at the bottom — modest gain, ~10–20% wall-clock reduction. *Not pursued.* +2. **Split into `Dockerfile.base` + `Dockerfile.variant`** with the base published as a long-lived shared image — significant gain, ~50–70% wall-clock reduction with hash-driven cache reuse. *Pursued.* + +The split-base architecture is what the `docker-publish-split.yml` workflow exercises. + +## How the split-base pipeline works + +``` + ┌──────────────────┐ + │ base-decide │ compute base-; + │ │ probe Docker Hub. + │ hash inputs: │ + │ Dockerfile.base│ + │ rootfs/ │ + │ entrypoint*.sh │ + └────────┬─────────┘ + │ + ┌─────────────┴─────────────┐ + │ need_build = true? │ + └─────────────┬─────────────┘ + yes │ no + ▼ + ┌──────────────────┐ + │ build-base │ multi-arch build, + │ │ push base- + └────────┬─────────┘ to Docker Hub. + │ + ┌───────────────────────┼───────────────────────┐ + ▼ ▼ ▼ + ┌──────────┐ ┌──────────┐ ┌──────────────┐ + │smoke-base│ │smoke-omos│ ... │smoke-omos-pi │ amd64 only, + └────┬─────┘ └────┬─────┘ └──────┬───────┘ parallel. + │ │ │ + ▼ ▼ ▼ + ┌──────────┐ ┌──────────┐ ┌──────────────┐ + │build- │ │build- │ │build- │ multi-arch, + │variant- │ │variant- │ ... │variant- │ parallel, + │base │ │omos │ │omos-with-pi │ tag push. + └────┬─────┘ └────┬─────┘ └──────┬───────┘ + └───────────────────────┴──────────────────────┘ + │ + ▼ + ┌──────────────────────────┐ + │ promote-base-latest │ crane copy + │ │ base- + │ │ → base-latest + └────────┬─────────────────┘ + │ + ▼ + ┌──────────────────────────┐ + │ update-description │ + └──────────────────────────┘ +``` + +### Step 1: `base-decide` + +Compute a SHA-256 hash over the inputs that determine the base image's +content: + +```sh +{ + cat Dockerfile.base + find rootfs -type f -print0 | sort -z | xargs -0 cat + cat entrypoint.sh entrypoint-user.sh +} | sha256sum | cut -c1-12 +``` + +The 12-character truncated hash becomes `base-`. Probe Docker Hub +for this tag via `docker manifest inspect`: + +- If it exists → set `need_build=false`. `build-base` is skipped entirely. +- If it doesn't → set `need_build=true`. `build-base` runs. + +This is the core cache-reuse mechanism. Version-bump-only releases +(only `Dockerfile.variant` or build-args changed) hit the cache. Releases +that change anything in the base — apt packages, AWS CLI, Node version, +locale list, entrypoint scripts — pay the full base-build cost once. + +### Step 2: `build-base` (conditional) + +Only runs when `need_build=true`. Multi-arch (amd64 + arm64) build of +`Dockerfile.base`, pushed to `joakimp/opencode-devbox:base-`. +Registry cache via `--cache-from/--cache-to` reduces incremental rebuilds +when only one or two layers changed. + +The base image is **not** tagged `base-latest` here — that promotion +happens at the very end after all variants succeed (see step 5). + +### Step 3: `smoke-*` (×4, parallel) + +For each variant: build amd64-only against the base tag, load into +local docker, run [`scripts/smoke-test.sh`](../scripts/smoke-test.sh). +Variant build-args: + +| variant | INSTALL_OPENCODE | INSTALL_OMOS | INSTALL_PI | +|---|---|---|---| +| `base` | true | false | false | +| `omos` | true | true | false | +| `with-pi` | true | false | true | +| `omos-with-pi` | true | true | true | + +Smoke runs `--variant ` to enable variant-specific assertions. +Gate the publish: a smoke failure for variant X blocks `build-variant-X`. + +### Step 4: `build-variant-*` (×4, parallel) + +For each variant that passed smoke: multi-arch (amd64 + arm64) build of +`Dockerfile.variant`, pushed to Docker Hub with the user-facing release +tags: + +| Build job | Tags pushed | +|---|---| +| `build-variant-base` | `vX.Y.Z`, `latest` | +| `build-variant-omos` | `vX.Y.Z-omos`, `latest-omos` | +| `build-variant-with-pi` | `vX.Y.Z-with-pi`, `latest-with-pi` | +| `build-variant-omos-with-pi` | `vX.Y.Z-omos-with-pi`, `latest-omos-with-pi` | + +The `latest*` aliases are only updated when `promote_latest=true` (the +manual dispatch input) — for test runs, `promote_latest=false` keeps the +production aliases pointing at the previous good release. + +### Step 5: `promote-base-latest` + +Once all four variants successfully publish, re-tag `base-` as +`base-latest` using `crane copy`. This is a **manifest-level re-tag, not +a rebuild** — it touches only Docker Hub's image index, takes seconds, +and is atomic. + +The reason this happens *after* variants succeed (rather than alongside +`build-base`) is so a partial failure leaves `base-latest` pointing at +the previous known-good base. External consumers who pin to +`base-latest` (e.g. the planned pi-devbox repo) never see a broken base. + +### Step 6: `update-description` + +Push the generated `DOCKER_HUB.md` to the Hub repo's `full_description` +field via the Hub REST API. Same step as the production pipeline. + +## NPM_CONFIG_PREFIX gotcha (variant override pattern) + +The base sets + +``` +ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global +``` + +This is intentional — it makes `pi install npm:` and `npm install -g` +land on the `devbox-pi-config` named volume at runtime, so user-installed +packages survive container recreate AND image rebuild. + +But the *variant build* inherits this prefix at build time. If left as-is, +`npm install -g opencode-ai@$VERSION` in `Dockerfile.variant` would +install opencode into `/home/developer/.pi/npm-global/...`, which is then +**shadowed by the volume mount at runtime** → opencode disappears from +PATH on first start. + +Fix: each `npm install -g` in `Dockerfile.variant` overrides the prefix +per-RUN: + +```dockerfile +RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION} +``` + +Baked binaries land on `/usr/bin/...` (system prefix), survive the volume +mount. Runtime-installed user packages still land on +`~/.pi/npm-global/...`. Both visible on PATH. + +## Cache strategy + +Two registry caches are configured: + +```yaml +cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache +cache-to: type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max + +cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache +cache-to: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max +``` + +`mode=max` exports cache for *all* layers, not just the final image's +layers. Important for multi-arch builds where the cross-arch layer reuse +matters more. + +## Wall-clock estimates + +| Scenario | Production pipeline | Split-base pipeline | +|---|---|---| +| Version-bump-only release (only opencode/pi/omos version changed) | ~165–180 min | **~30–40 min** (base cache hit) | +| Base-touching release (apt/Node/Debian/entrypoint change) | ~165–180 min | **~70–90 min** (base rebuilds) | + +The split-base pipeline pays its dues on base-touching releases (which are +infrequent — a few times a year for Debian / Node major version bumps). +Most releases are version-bumps and ride the cache. + +## Validate workflow + +[`validate.yml`](workflows/validate.yml) is the lightweight gate that runs +on every push to `main` and on PRs. It: + +1. Runs `scripts/generate-dockerhub-md.py --check` to enforce + `DOCKER_HUB.md` is in sync with `HUB_TEMPLATE`. +2. Builds each of the four variants amd64-only (no multi-arch, no push) + and runs `scripts/smoke-test.sh`. + +This catches regressions before they reach a tag push. Wall clock ~30 min. + +## Runner expectations + +- **Image:** `catthehacker/ubuntu:act-latest`. Each job runs inside a + fresh container of this image. Don't assume any pre-installed + toolchains beyond what catthehacker ships. +- **Disk pressure:** the runner host has ~40 GB of usable overlay space, + often 70%+ used at job start. Every job that does `load: true` (smoke) + starts with a `Reclaim runner disk` step that strips + catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM, + Boost, Chromium, PowerShell) and prunes stale docker state. Don't + remove these steps without testing on a fresh runner. +- **Concurrency:** 2 runners. Jobs in the same workflow run can fan out to + both; jobs in *different* workflow runs are serialized by gitea's queue. + The `concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false }` + setting keeps tag pushes from racing each other but allows + per-PR/per-branch parallelism. +- **Workflow visibility in UI:** gitea Actions only surfaces workflows + from the **default branch** in the web UI's workflow list, even for + `workflow_dispatch` triggers. Workflows on feature branches are + invisible until merged to `main`. +- **Disk reclaim quirk:** `actions/{upload,download}-artifact@v4+` does + not work on Gitea (depends on a GitHub-only Artifact API). Stick to + `@v3` if matrix-fanout-with-artifacts is ever needed. We avoided this + by using `docker/build-push-action@v7` with comma-separated + `platforms: linux/amd64,linux/arm64` — natively does multi-arch push + in a single job, no artifact dance. + +## Migration plan: split-base → production + +1. **Validate the split-base dispatch.** Trigger + `docker-publish-split.yml` manually with `release_tag=v0.0.0-split-test` + and `promote_latest=false`. Confirm all jobs go green, image sizes + match the production baseline within ~10%, and no unexpected layer + rebuilds appear in `build-variant-*` logs after the FROM line. +2. **Run a second dispatch** to confirm cache-hit behavior: + `base-decide` should set `need_build=false`, `build-base` should be + skipped entirely, total wall clock should drop to ~25–40 min. +3. **Cut over.** In a single commit: + - Edit `docker-publish-split.yml`: change `on: workflow_dispatch:` to + `on: push: tags: v*` and wire `$GITHUB_REF` into the `release_tag` + input, set `promote_latest=true` for production runs. + - Delete `docker-publish.yml`. + - Delete the original `Dockerfile` (keep `Dockerfile.base` + + `Dockerfile.variant`). + - Update `CHANGELOG.md`: promote the "Build pipeline" Unreleased entry. +4. **Tag a release.** First production release on the new pipeline. Watch + it like a hawk for the first run. + +## Related docs + +- [`AGENTS.md`](../AGENTS.md) — domain facts, release-day checklist, + documentation coupling rules. Read first when modifying CI behavior. +- [`CHANGELOG.md`](../CHANGELOG.md) — the build pipeline rewrite is + recorded under `Unreleased` until the cutover lands. +- `Dockerfile`, `Dockerfile.base`, `Dockerfile.variant` — production + single-Dockerfile build and the split-base counterparts. Comments at + the top of each explain its role. +- [`scripts/smoke-test.sh`](../scripts/smoke-test.sh) — invoked by all + three workflows; this is the single source of truth for "what does a + built image have to satisfy". +- [`scripts/generate-dockerhub-md.py`](../scripts/generate-dockerhub-md.py) + — generates `DOCKER_HUB.md` from `HUB_TEMPLATE`. `--check` enforces + sync in `validate.yml`. diff --git a/AGENTS.md b/AGENTS.md index 7542050..587783a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -15,6 +15,7 @@ Docker image packaging [opencode](https://opencode.ai) into a production-ready d - `scripts/generate-dockerhub-md.py` — generates `DOCKER_HUB.md` from a hand-maintained `HUB_TEMPLATE` constant. `--check` fails if the committed file is out of sync (enforced by the `validate` workflow). - `DOCKER_HUB.md` — **auto-generated** from `HUB_TEMPLATE` in `scripts/generate-dockerhub-md.py`. Do not edit directly. Pushed to Docker Hub description via CI API call. Must stay under 25 kB. Short description field must be ≤100 bytes. - `README.md` — authoritative source documentation for everything in this repo. Independent of `DOCKER_HUB.md`: the Hub doc is hand-maintained in the generator's `HUB_TEMPLATE` and intentionally slim, linking back to the gitea README for depth. +- `.gitea/README.md` — **read this first** if you're touching CI. Architectural overview of the build pipeline (production vs split-base), wall-clock estimates, NPM_CONFIG_PREFIX gotcha, runner expectations, migration plan. - `.gitea/workflows/validate.yml` — lightweight amd64 build + smoke test on push to main and PRs. Also runs the DOCKER_HUB.md sync check. - `.gitea/workflows/docker-publish.yml` — production CI pipeline on tag push: smoke-test each variant on amd64, then full multi-arch (amd64 + arm64) build-and-push, then update Docker Hub description. - `.gitea/workflows/docker-publish-split.yml` — **WIP, branch `feat/split-build` only.** Two-phase split-base pipeline. Triggers on `workflow_dispatch` only so it runs alongside the production pipeline without conflict. Pushes to user-supplied `release_tag` input (e.g. `v0.0.0-split-test`); `latest*` aliases only updated when `promote_latest: true`. Compute base hash, conditionally build base, then 4 variant deltas in parallel. diff --git a/CHANGELOG.md b/CHANGELOG.md index e33cef0..c3671eb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,10 @@ Tags follow `v{opencode_version}[letter]` — bare tag for the first build on a ## Unreleased +Docs: + +- **New: `.gitea/README.md`** — architectural overview of the build pipeline. Documents the production single-Dockerfile path vs the merged-but-unvalidated split-base path, hash-driven base cache reuse, wall-clock estimates, the `NPM_CONFIG_PREFIX` variant-override pattern, runner expectations (catthehacker container, disk reclaim, concurrency, gitea-actions @v4 artifact gotcha), and the cutover plan. Auto-renders when navigating to `.gitea/` in the gitea web UI. Linked from `AGENTS.md` as the first thing to read when touching CI. + Build pipeline (merged to main as `Dockerfile.base` + `Dockerfile.variant` + `.gitea/workflows/docker-publish-split.yml`, NOT yet validated end-to-end — the `workflow_dispatch` test against `:base-` + `:v0.0.0-split-test*` aliases is still the gating step before this can take over `on: push: tags: v*`): - **New: split-base build pipeline.** `Dockerfile.base` (variant-independent layers — apt, locales, AWS CLI, Node.js, mempalace, gitea-mcp, user setup, chromadb prewarm, ENVs, entrypoints) builds once and is published as `joakimp/opencode-devbox:base-`. `Dockerfile.variant` `FROM`s that base and adds only opencode/omos/pi installs (or skips them per build-args). Companion workflow `.gitea/workflows/docker-publish-split.yml` runs as a `workflow_dispatch`-only pipeline alongside the existing `docker-publish.yml` so they don't conflict. Hash-driven base reuse: a content hash of `Dockerfile.base + rootfs/ + entrypoint*.sh` becomes the base tag; if the tag already exists on Docker Hub, the base build is skipped entirely. Estimated wall clock: version-bump-only release ~30–40 min (vs ~165–180 min today); base-touching release ~60–70 min. Trade-off: two Dockerfiles to maintain, and `npm install -g` in the variant must override `NPM_CONFIG_PREFIX=/usr` per-RUN to keep baked binaries off the volume-shadowed path. Once 1–2 successful workflow_dispatch runs validate the output against the existing pipeline, the new workflow takes over `on: push: tags: v*` and the original is retired.