# CI / Build Pipeline This directory contains the gitea Actions workflows and the supporting documentation for opencode-devbox's CI. If you're investigating *why* the build pipeline is shaped the way it is, you're in the right place. ## Workflows in this directory | File | Trigger | Role | |---|---|---| | [`workflows/docker-publish-split.yml`](workflows/docker-publish-split.yml) | `push: tags: v*` | **Production release pipeline.** Two-phase split-base build: shared `base-` published once (skipped on cache hit), then four parallel variant deltas. ~40–80 min wall clock depending on runner count and whether base needs rebuilding. | | [`workflows/validate.yml`](workflows/validate.yml) | `push: branches: main` + PR | **Lightweight gate.** amd64-only smoke test of all four variants + `DOCKER_HUB.md` sync check. ~30 min. Fires on every push to `main`. | ## Why the split-base pipeline exists opencode-devbox publishes **four image variants** (`base`, `omos`, `with-pi`, `omos-with-pi`) × **two architectures** (amd64, arm64) = **eight image tags per release**. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~3–5x slower than native). The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original `Dockerfile` was a single multi-stage build with `INSTALL_*` build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated `RUN` produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build. Two improvements were considered: 1. **Reorder the original Dockerfile** so all variant-gated RUNs land at the bottom — modest gain, ~10–20% wall-clock reduction. *Not pursued.* 2. **Split into `Dockerfile.base` + `Dockerfile.variant`** with the base published as a long-lived shared image — significant gain, ~50–70% wall-clock reduction with hash-driven cache reuse. *Pursued.* The split-base architecture is what the `docker-publish-split.yml` workflow exercises. ## How the split-base pipeline works ``` ┌──────────────────┐ │ base-decide │ compute base-; │ │ probe Docker Hub. │ hash inputs: │ │ Dockerfile.base│ │ rootfs/ │ │ entrypoint*.sh │ └────────┬─────────┘ │ ┌─────────────┴─────────────┐ │ need_build = true? │ └─────────────┬─────────────┘ yes │ no ▼ ┌──────────────────┐ │ build-base │ multi-arch build, │ │ push base- └────────┬─────────┘ to Docker Hub. │ ┌───────────────────────┼───────────────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │smoke-base│ │smoke-omos│ ... │smoke-omos-pi │ amd64 only, └────┬─────┘ └────┬─────┘ └──────┬───────┘ parallel. │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │build- │ │build- │ │build- │ multi-arch, │variant- │ │variant- │ ... │variant- │ parallel, │base │ │omos │ │omos-with-pi │ tag push. └────┬─────┘ └────┬─────┘ └──────┬───────┘ └───────────────────────┴──────────────────────┘ │ ▼ ┌──────────────────────────┐ │ promote-base-latest │ crane copy │ │ base- │ │ → base-latest └────────┬─────────────────┘ │ ▼ ┌──────────────────────────┐ │ update-description │ └──────────────────────────┘ ``` ### Step 1: `base-decide` Compute a SHA-256 hash over the inputs that determine the base image's content: ```sh { cat Dockerfile.base find rootfs -type f \ ! -path '*/__pycache__/*' \ ! -name '*.pyc' \ ! -name '.DS_Store' \ ! -name '._*' \ -print0 | sort -z | xargs -0 cat cat entrypoint.sh entrypoint-user.sh } | sha256sum | cut -c1-12 ``` Junk filters keep the local recompute reproducible against CI's clean checkout — `__pycache__/*.pyc` and macOS metadata files (`.DS_Store`, `._AppleDouble`) are gitignored but still walked by `find -type f`. The 12-character truncated hash becomes `base-`. Probe Docker Hub for this tag via `docker manifest inspect`: - If it exists → set `need_build=false`. `build-base` is skipped entirely. - If it doesn't → set `need_build=true`. `build-base` runs. This is the core cache-reuse mechanism. Version-bump-only releases (only `Dockerfile.variant` or build-args changed) hit the cache. Releases that change anything in the base — apt packages, AWS CLI, Node version, locale list, entrypoint scripts — pay the full base-build cost once. ### Step 2: `build-base` (conditional) Only runs when `need_build=true`. Multi-arch (amd64 + arm64) build of `Dockerfile.base`, pushed to `joakimp/opencode-devbox:base-`. Registry cache via `--cache-from/--cache-to` reduces incremental rebuilds when only one or two layers changed. The base image is **not** tagged `base-latest` here — that promotion happens at the very end after all variants succeed (see step 5). ### Step 3: `smoke-*` (×4, parallel) For each variant: build amd64-only against the base tag, load into local docker, run [`scripts/smoke-test.sh`](../scripts/smoke-test.sh). Variant build-args: | variant | INSTALL_OPENCODE | INSTALL_OMOS | INSTALL_PI | |---|---|---|---| | `base` | true | false | false | | `omos` | true | true | false | | `with-pi` | true | false | true | | `omos-with-pi` | true | true | true | Smoke runs `--variant ` to enable variant-specific assertions. Gate the publish: a smoke failure for variant X blocks `build-variant-X`. ### Step 4: `build-variant-*` (×4, parallel) For each variant that passed smoke: multi-arch (amd64 + arm64) build of `Dockerfile.variant`, pushed to Docker Hub with the user-facing release tags: | Build job | Tags pushed | |---|---| | `build-variant-base` | `vX.Y.Z`, `latest` | | `build-variant-omos` | `vX.Y.Z-omos`, `latest-omos` | | `build-variant-with-pi` | `vX.Y.Z-with-pi`, `latest-with-pi` | | `build-variant-omos-with-pi` | `vX.Y.Z-omos-with-pi`, `latest-omos-with-pi` | The `latest*` aliases are only updated when `promote_latest=true` (the manual dispatch input) — for test runs, `promote_latest=false` keeps the production aliases pointing at the previous good release. ### Step 5: `promote-base-latest` Once all four variants successfully publish, re-tag `base-` as `base-latest` using `crane copy`. This is a **manifest-level re-tag, not a rebuild** — it touches only Docker Hub's image index, takes seconds, and is atomic. The reason this happens *after* variants succeed (rather than alongside `build-base`) is so a partial failure leaves `base-latest` pointing at the previous known-good base. External consumers who pin to `base-latest` (e.g. the planned pi-devbox repo) never see a broken base. ### Step 6: `update-description` Push the generated `DOCKER_HUB.md` to the Hub repo's `full_description` field via the Hub REST API. Same step as the production pipeline. ## NPM_CONFIG_PREFIX gotcha (variant override pattern) The base sets ``` ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global ``` This is intentional — it makes `pi install npm:` and `npm install -g` land on the `devbox-pi-config` named volume at runtime, so user-installed packages survive container recreate AND image rebuild. But the *variant build* inherits this prefix at build time. If left as-is, `npm install -g opencode-ai@$VERSION` in `Dockerfile.variant` would install opencode into `/home/developer/.pi/npm-global/...`, which is then **shadowed by the volume mount at runtime** → opencode disappears from PATH on first start. Fix: each `npm install -g` in `Dockerfile.variant` overrides the prefix per-RUN: ```dockerfile RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION} ``` Baked binaries land on `/usr/bin/...` (system prefix), survive the volume mount. Runtime-installed user packages still land on `~/.pi/npm-global/...`. Both visible on PATH. ## Cache strategy Two registry caches are configured: ```yaml cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache cache-to: type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache cache-to: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max ``` `mode=max` exports cache for *all* layers, not just the final image's layers. Important for multi-arch builds where the cross-arch layer reuse matters more. ## Wall-clock estimates | Scenario | Production pipeline | Split-base pipeline | |---|---|---| | Version-bump-only release (only opencode/pi/omos version changed) | ~165–180 min | **~30–40 min** (base cache hit) | | Base-touching release (apt/Node/Debian/entrypoint change) | ~165–180 min | **~70–90 min** (base rebuilds) | The split-base pipeline pays its dues on base-touching releases (which are infrequent — a few times a year for Debian / Node major version bumps). Most releases are version-bumps and ride the cache. ## Validate workflow [`validate.yml`](workflows/validate.yml) is the lightweight gate that runs on every push to `main` and on PRs. It: 1. Runs `scripts/generate-dockerhub-md.py --check` to enforce `DOCKER_HUB.md` is in sync with `HUB_TEMPLATE`. 2. Builds each of the four variants amd64-only (no multi-arch, no push) and runs `scripts/smoke-test.sh`. This catches regressions before they reach a tag push. Wall clock ~30 min. ## Runner expectations - **Image:** `catthehacker/ubuntu:act-latest`. Each job runs inside a fresh container of this image. Don't assume any pre-installed toolchains beyond what catthehacker ships. - **Disk pressure:** the runner host has ~40 GB of usable overlay space, often 70%+ used at job start. Every job that does `load: true` (smoke) starts with a `Reclaim runner disk` step that strips catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM, Boost, Chromium, PowerShell) and prunes stale docker state. Don't remove these steps without testing on a fresh runner. - **Concurrency:** 2 runners. Jobs in the same workflow run can fan out to both; jobs in *different* workflow runs are serialized by gitea's queue. The `concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false }` setting keeps tag pushes from racing each other but allows per-PR/per-branch parallelism. - **Workflow visibility in UI:** gitea Actions only surfaces workflows from the **default branch** in the web UI's workflow list, even for `workflow_dispatch` triggers. Workflows on feature branches are invisible until merged to `main`. - **Disk reclaim quirk:** `actions/{upload,download}-artifact@v4+` does not work on Gitea (depends on a GitHub-only Artifact API). Stick to `@v3` if matrix-fanout-with-artifacts is ever needed. We avoided this by using `docker/build-push-action@v7` with comma-separated `platforms: linux/amd64,linux/arm64` — natively does multi-arch push in a single job, no artifact dance. ## Migration plan: split-base → production 1. **Validate the split-base dispatch.** Trigger `docker-publish-split.yml` manually with `release_tag=v0.0.0-split-test` and `promote_latest=false`. Confirm all jobs go green, image sizes match the production baseline within ~10%, and no unexpected layer rebuilds appear in `build-variant-*` logs after the FROM line. 2. **Run a second dispatch** to confirm cache-hit behavior: `base-decide` should set `need_build=false`, `build-base` should be skipped entirely, total wall clock should drop to ~25–40 min. 3. **Cut over** — *done as of v1.14.50.* `docker-publish-split.yml` now triggers on `push: tags: v*`. `docker-publish.yml` and original `Dockerfile` deleted. 4. **Tag a release.** First production release on the new pipeline. ## Related docs - [`AGENTS.md`](../AGENTS.md) — domain facts, release-day checklist, documentation coupling rules. Read first when modifying CI behavior. - [`CHANGELOG.md`](../CHANGELOG.md) — build pipeline rewrite landed in v1.14.50. - `Dockerfile.base`, `Dockerfile.variant` — the split-base Dockerfiles. Comments at the top of each explain their role. - [`scripts/smoke-test.sh`](../scripts/smoke-test.sh) — invoked by all three workflows; this is the single source of truth for "what does a built image have to satisfy". - [`scripts/generate-dockerhub-md.py`](../scripts/generate-dockerhub-md.py) — generates `DOCKER_HUB.md` from `HUB_TEMPLATE`. `--check` enforces sync in `validate.yml`.