Files
opencode-devbox/.gitea/README.md
T
joakimp 6fde27c212
Validate / docs-check (push) Successful in 16s
Validate / validate-base (push) Successful in 12m9s
Validate / validate-omos (push) Successful in 16m45s
Validate / validate-with-pi (push) Successful in 13m30s
Validate / validate-omos-with-pi (push) Successful in 15m15s
Document the build pipeline architecture in .gitea/README.md
The split-base build architecture, the NPM_CONFIG_PREFIX gotcha, the
hash-driven base cache reuse mechanism, and the cutover plan from
docker-publish.yml to docker-publish-split.yml were previously
scattered across:
  - inline Dockerfile.base / Dockerfile.variant comments
  - CHANGELOG Unreleased entries
  - AGENTS.md mentions
  - docker-publish-split.yml header comment
  - my own session notes

Consolidate into .gitea/README.md as the canonical architectural doc.
Gitea (like GitHub) auto-renders this when navigating to .gitea/ in
the web UI, so anyone investigating 'why is CI shaped this way?'
finds it on the first click. Cross-referenced from AGENTS.md as the
first thing to read when touching CI.

Covers:
  - The two release pipelines and why both exist
  - Why split-base: cross-variant cache misses on layer-hash-divergence
  - The 6 phases of the split-base pipeline with an ASCII diagram
  - base-decide hash inputs and Docker Hub probe logic
  - NPM_CONFIG_PREFIX variant-override pattern (the volume-shadow trap)
  - Registry cache strategy (mode=max for cross-arch reuse)
  - Wall-clock estimates: version-bump vs base-touching releases
  - Validate workflow role
  - Runner expectations: catthehacker image, disk reclaim, concurrency,
    Gitea Actions @v4 artifact incompatibility
  - 4-step migration plan from docker-publish.yml to .split.yml
  - Cross-refs to related docs

Does not duplicate AGENTS.md content; links to it for domain facts and
release-day checklist.
2026-05-09 19:28:03 +02:00

292 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CI / Build Pipeline
This directory contains the gitea Actions workflows and the supporting
documentation for opencode-devbox's CI. If you're investigating *why*
the build pipeline is shaped the way it is, you're in the right place.
## Workflows in this directory
| File | Trigger | Role |
|---|---|---|
| [`workflows/docker-publish.yml`](workflows/docker-publish.yml) | `push: tags: v*` | **Production release pipeline.** Multi-arch build of all four variants (`base`, `omos`, `with-pi`, `omos-with-pi`), publish to Docker Hub, update Hub description. ~165180 min wall clock. |
| [`workflows/docker-publish-split.yml`](workflows/docker-publish-split.yml) | `workflow_dispatch` (manual) | **Experimental split-base pipeline.** Two-phase build: shared `base-<hash>` published once, then four thin variant deltas. Estimated ~3040 min on cache hit, ~7090 min when base needs rebuilding. Not yet validated end-to-end; once 12 dispatch test runs prove it, this will take over `on: push: tags: v*` and `docker-publish.yml` will be retired. |
| [`workflows/validate.yml`](workflows/validate.yml) | `push: branches: main` + PR | **Lightweight gate.** amd64-only smoke test of all four variants + `DOCKER_HUB.md` sync check. ~30 min. Fires on every push to `main`. |
## Why two release pipelines exist
opencode-devbox publishes **four image variants** (`base`, `omos`, `with-pi`, `omos-with-pi`) × **two architectures** (amd64, arm64) = **eight image tags per release**. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~35x slower than native).
The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original `Dockerfile` was a single multi-stage build with `INSTALL_*` build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated `RUN` produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.
Two improvements were considered:
1. **Reorder the original Dockerfile** so all variant-gated RUNs land at the bottom — modest gain, ~1020% wall-clock reduction. *Not pursued.*
2. **Split into `Dockerfile.base` + `Dockerfile.variant`** with the base published as a long-lived shared image — significant gain, ~5070% wall-clock reduction with hash-driven cache reuse. *Pursued.*
The split-base architecture is what the `docker-publish-split.yml` workflow exercises.
## How the split-base pipeline works
```
┌──────────────────┐
│ base-decide │ compute base-<hash>;
│ │ probe Docker Hub.
│ hash inputs: │
│ Dockerfile.base│
│ rootfs/ │
│ entrypoint*.sh │
└────────┬─────────┘
┌─────────────┴─────────────┐
│ need_build = true? │
└─────────────┬─────────────┘
yes │ no
┌──────────────────┐
│ build-base │ multi-arch build,
│ │ push base-<hash>
└────────┬─────────┘ to Docker Hub.
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│smoke-base│ │smoke-omos│ ... │smoke-omos-pi │ amd64 only,
└────┬─────┘ └────┬─────┘ └──────┬───────┘ parallel.
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│build- │ │build- │ │build- │ multi-arch,
│variant- │ │variant- │ ... │variant- │ parallel,
│base │ │omos │ │omos-with-pi │ tag push.
└────┬─────┘ └────┬─────┘ └──────┬───────┘
└───────────────────────┴──────────────────────┘
┌──────────────────────────┐
│ promote-base-latest │ crane copy
│ │ base-<hash>
│ │ → base-latest
└────────┬─────────────────┘
┌──────────────────────────┐
│ update-description │
└──────────────────────────┘
```
### Step 1: `base-decide`
Compute a SHA-256 hash over the inputs that determine the base image's
content:
```sh
{
cat Dockerfile.base
find rootfs -type f -print0 | sort -z | xargs -0 cat
cat entrypoint.sh entrypoint-user.sh
} | sha256sum | cut -c1-12
```
The 12-character truncated hash becomes `base-<hash>`. Probe Docker Hub
for this tag via `docker manifest inspect`:
- If it exists → set `need_build=false`. `build-base` is skipped entirely.
- If it doesn't → set `need_build=true`. `build-base` runs.
This is the core cache-reuse mechanism. Version-bump-only releases
(only `Dockerfile.variant` or build-args changed) hit the cache. Releases
that change anything in the base — apt packages, AWS CLI, Node version,
locale list, entrypoint scripts — pay the full base-build cost once.
### Step 2: `build-base` (conditional)
Only runs when `need_build=true`. Multi-arch (amd64 + arm64) build of
`Dockerfile.base`, pushed to `joakimp/opencode-devbox:base-<hash>`.
Registry cache via `--cache-from/--cache-to` reduces incremental rebuilds
when only one or two layers changed.
The base image is **not** tagged `base-latest` here — that promotion
happens at the very end after all variants succeed (see step 5).
### Step 3: `smoke-*` (×4, parallel)
For each variant: build amd64-only against the base tag, load into
local docker, run [`scripts/smoke-test.sh`](../scripts/smoke-test.sh).
Variant build-args:
| variant | INSTALL_OPENCODE | INSTALL_OMOS | INSTALL_PI |
|---|---|---|---|
| `base` | true | false | false |
| `omos` | true | true | false |
| `with-pi` | true | false | true |
| `omos-with-pi` | true | true | true |
Smoke runs `--variant <name>` to enable variant-specific assertions.
Gate the publish: a smoke failure for variant X blocks `build-variant-X`.
### Step 4: `build-variant-*` (×4, parallel)
For each variant that passed smoke: multi-arch (amd64 + arm64) build of
`Dockerfile.variant`, pushed to Docker Hub with the user-facing release
tags:
| Build job | Tags pushed |
|---|---|
| `build-variant-base` | `vX.Y.Z`, `latest` |
| `build-variant-omos` | `vX.Y.Z-omos`, `latest-omos` |
| `build-variant-with-pi` | `vX.Y.Z-with-pi`, `latest-with-pi` |
| `build-variant-omos-with-pi` | `vX.Y.Z-omos-with-pi`, `latest-omos-with-pi` |
The `latest*` aliases are only updated when `promote_latest=true` (the
manual dispatch input) — for test runs, `promote_latest=false` keeps the
production aliases pointing at the previous good release.
### Step 5: `promote-base-latest`
Once all four variants successfully publish, re-tag `base-<hash>` as
`base-latest` using `crane copy`. This is a **manifest-level re-tag, not
a rebuild** — it touches only Docker Hub's image index, takes seconds,
and is atomic.
The reason this happens *after* variants succeed (rather than alongside
`build-base`) is so a partial failure leaves `base-latest` pointing at
the previous known-good base. External consumers who pin to
`base-latest` (e.g. the planned pi-devbox repo) never see a broken base.
### Step 6: `update-description`
Push the generated `DOCKER_HUB.md` to the Hub repo's `full_description`
field via the Hub REST API. Same step as the production pipeline.
## NPM_CONFIG_PREFIX gotcha (variant override pattern)
The base sets
```
ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global
```
This is intentional — it makes `pi install npm:<pkg>` and `npm install -g`
land on the `devbox-pi-config` named volume at runtime, so user-installed
packages survive container recreate AND image rebuild.
But the *variant build* inherits this prefix at build time. If left as-is,
`npm install -g opencode-ai@$VERSION` in `Dockerfile.variant` would
install opencode into `/home/developer/.pi/npm-global/...`, which is then
**shadowed by the volume mount at runtime** → opencode disappears from
PATH on first start.
Fix: each `npm install -g` in `Dockerfile.variant` overrides the prefix
per-RUN:
```dockerfile
RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}
```
Baked binaries land on `/usr/bin/...` (system prefix), survive the volume
mount. Runtime-installed user packages still land on
`~/.pi/npm-global/...`. Both visible on PATH.
## Cache strategy
Two registry caches are configured:
```yaml
cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
cache-to: type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max
cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
cache-to: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max
```
`mode=max` exports cache for *all* layers, not just the final image's
layers. Important for multi-arch builds where the cross-arch layer reuse
matters more.
## Wall-clock estimates
| Scenario | Production pipeline | Split-base pipeline |
|---|---|---|
| Version-bump-only release (only opencode/pi/omos version changed) | ~165180 min | **~3040 min** (base cache hit) |
| Base-touching release (apt/Node/Debian/entrypoint change) | ~165180 min | **~7090 min** (base rebuilds) |
The split-base pipeline pays its dues on base-touching releases (which are
infrequent — a few times a year for Debian / Node major version bumps).
Most releases are version-bumps and ride the cache.
## Validate workflow
[`validate.yml`](workflows/validate.yml) is the lightweight gate that runs
on every push to `main` and on PRs. It:
1. Runs `scripts/generate-dockerhub-md.py --check` to enforce
`DOCKER_HUB.md` is in sync with `HUB_TEMPLATE`.
2. Builds each of the four variants amd64-only (no multi-arch, no push)
and runs `scripts/smoke-test.sh`.
This catches regressions before they reach a tag push. Wall clock ~30 min.
## Runner expectations
- **Image:** `catthehacker/ubuntu:act-latest`. Each job runs inside a
fresh container of this image. Don't assume any pre-installed
toolchains beyond what catthehacker ships.
- **Disk pressure:** the runner host has ~40 GB of usable overlay space,
often 70%+ used at job start. Every job that does `load: true` (smoke)
starts with a `Reclaim runner disk` step that strips
catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM,
Boost, Chromium, PowerShell) and prunes stale docker state. Don't
remove these steps without testing on a fresh runner.
- **Concurrency:** 2 runners. Jobs in the same workflow run can fan out to
both; jobs in *different* workflow runs are serialized by gitea's queue.
The `concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false }`
setting keeps tag pushes from racing each other but allows
per-PR/per-branch parallelism.
- **Workflow visibility in UI:** gitea Actions only surfaces workflows
from the **default branch** in the web UI's workflow list, even for
`workflow_dispatch` triggers. Workflows on feature branches are
invisible until merged to `main`.
- **Disk reclaim quirk:** `actions/{upload,download}-artifact@v4+` does
not work on Gitea (depends on a GitHub-only Artifact API). Stick to
`@v3` if matrix-fanout-with-artifacts is ever needed. We avoided this
by using `docker/build-push-action@v7` with comma-separated
`platforms: linux/amd64,linux/arm64` — natively does multi-arch push
in a single job, no artifact dance.
## Migration plan: split-base → production
1. **Validate the split-base dispatch.** Trigger
`docker-publish-split.yml` manually with `release_tag=v0.0.0-split-test`
and `promote_latest=false`. Confirm all jobs go green, image sizes
match the production baseline within ~10%, and no unexpected layer
rebuilds appear in `build-variant-*` logs after the FROM line.
2. **Run a second dispatch** to confirm cache-hit behavior:
`base-decide` should set `need_build=false`, `build-base` should be
skipped entirely, total wall clock should drop to ~2540 min.
3. **Cut over.** In a single commit:
- Edit `docker-publish-split.yml`: change `on: workflow_dispatch:` to
`on: push: tags: v*` and wire `$GITHUB_REF` into the `release_tag`
input, set `promote_latest=true` for production runs.
- Delete `docker-publish.yml`.
- Delete the original `Dockerfile` (keep `Dockerfile.base` +
`Dockerfile.variant`).
- Update `CHANGELOG.md`: promote the "Build pipeline" Unreleased entry.
4. **Tag a release.** First production release on the new pipeline. Watch
it like a hawk for the first run.
## Related docs
- [`AGENTS.md`](../AGENTS.md) — domain facts, release-day checklist,
documentation coupling rules. Read first when modifying CI behavior.
- [`CHANGELOG.md`](../CHANGELOG.md) — the build pipeline rewrite is
recorded under `Unreleased` until the cutover lands.
- `Dockerfile`, `Dockerfile.base`, `Dockerfile.variant` — production
single-Dockerfile build and the split-base counterparts. Comments at
the top of each explain its role.
- [`scripts/smoke-test.sh`](../scripts/smoke-test.sh) — invoked by all
three workflows; this is the single source of truth for "what does a
built image have to satisfy".
- [`scripts/generate-dockerhub-md.py`](../scripts/generate-dockerhub-md.py)
— generates `DOCKER_HUB.md` from `HUB_TEMPLATE`. `--check` enforces
sync in `validate.yml`.