fc034ceade
Validate / docs-check (push) Successful in 10s
Validate / base-change-warning (push) Successful in 23s
Validate / validate-omos (push) Successful in 4m36s
Validate / validate-omos-with-pi (push) Failing after 5m40s
Validate / validate-with-pi (push) Failing after 7m35s
Validate / validate-pi-only (push) Failing after 3m45s
Validate / validate-base (push) Failing after 16m12s
All opencode-devbox variants set INSTALL_OPENCODE=true, so pointing pi-devbox at with-pi dragged opencode along and made it ~a re-tag of latest-with-pi. Add a 5th variant pi-only (INSTALL_OPENCODE=false, INSTALL_PI=true): pi + companions (toolkit, extensions, fork, recall) + base tooling, no opencode (~145 MB lighter than with-pi). - Dockerfile.variant: document pi-only in the variant table. - CI docker-publish-split.yml: new smoke-pi-only + build-variant-pi-only jobs (tags :VERSION-pi-only / :latest-pi-only, multi-arch); wired into promote-base-latest and update-description needs. - validate.yml: new validate-pi-only main-branch gate job. - smoke-test.sh: accept --variant pi-only; threshold 2750 MB; opencode-absent path already handled. - Docs: HUB_TEMPLATE (regenerated DOCKER_HUB.md), README, AGENTS (variant/tag counts 4->5, 8->10 tags), .gitea/README, manual-host-publish.sh (5 variants), plan doc implementation note. This is the single source of truth for joakimp/pi-devbox, which now FROMs latest-pi-only. Versions unchanged (opencode 1.15.13, pi 0.78.0).
315 lines
16 KiB
Markdown
315 lines
16 KiB
Markdown
# CI / Build Pipeline
|
||
|
||
This directory contains the gitea Actions workflows and the supporting
|
||
documentation for opencode-devbox's CI. If you're investigating *why*
|
||
the build pipeline is shaped the way it is, you're in the right place.
|
||
|
||
## Workflows in this directory
|
||
|
||
| File | Trigger | Role |
|
||
|---|---|---|
|
||
| [`workflows/docker-publish-split.yml`](workflows/docker-publish-split.yml) | `push: tags: v*` | **Production release pipeline.** Two-phase split-base build: shared `base-<hash>` published once (skipped on cache hit), then five parallel variant deltas. ~40–80 min wall clock depending on runner count and whether base needs rebuilding. |
|
||
| [`workflows/validate.yml`](workflows/validate.yml) | `push: branches: main` + PR | **Lightweight gate.** amd64-only smoke test of all five variants + `DOCKER_HUB.md` sync check. ~30 min. Fires on every push to `main`. |
|
||
|
||
## Why the split-base pipeline exists
|
||
|
||
opencode-devbox publishes **four image variants** (`base`, `omos`, `with-pi`, `omos-with-pi`) × **two architectures** (amd64, arm64) = **eight image tags per release**. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~3–5x slower than native).
|
||
|
||
The five variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original `Dockerfile` was a single multi-stage build with `INSTALL_*` build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated `RUN` produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.
|
||
|
||
Two improvements were considered:
|
||
|
||
1. **Reorder the original Dockerfile** so all variant-gated RUNs land at the bottom — modest gain, ~10–20% wall-clock reduction. *Not pursued.*
|
||
2. **Split into `Dockerfile.base` + `Dockerfile.variant`** with the base published as a long-lived shared image — significant gain, ~50–70% wall-clock reduction with hash-driven cache reuse. *Pursued.*
|
||
|
||
The split-base architecture is what the `docker-publish-split.yml` workflow exercises.
|
||
|
||
## How the split-base pipeline works
|
||
|
||
```
|
||
┌──────────────────┐
|
||
│ base-decide │ compute base-<hash>;
|
||
│ │ probe Docker Hub.
|
||
│ hash inputs: │ (resolve-versions
|
||
│ Dockerfile.base│ runs in parallel:
|
||
│ rootfs/ │ npm view pi/omos
|
||
│ entrypoint*.sh │ → concrete versions)
|
||
└────────┬─────────┘
|
||
│
|
||
┌─────────────┴─────────────┐
|
||
│ need_build = true? │
|
||
└─────────────┬─────────────┘
|
||
yes │ no
|
||
▼
|
||
┌──────────────────┐
|
||
│ build-base │ multi-arch build,
|
||
│ │ push base-<hash>
|
||
└────────┬─────────┘ to Docker Hub.
|
||
│
|
||
┌───────────────────────┼───────────────────────┐
|
||
▼ ▼ ▼
|
||
┌──────────┐ ┌──────────┐ ┌──────────────┐
|
||
│smoke-base│ │smoke-omos│ ... │smoke-omos-pi │ amd64 only,
|
||
└────┬─────┘ └────┬─────┘ └──────┬───────┘ parallel.
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
┌──────────┐ ┌──────────┐ ┌──────────────┐
|
||
│build- │ │build- │ │build- │ multi-arch,
|
||
│variant- │ │variant- │ ... │variant- │ parallel,
|
||
│base │ │omos │ │omos-with-pi │ tag push.
|
||
└────┬─────┘ └────┬─────┘ └──────┬───────┘
|
||
└───────────────────────┴──────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────┐
|
||
│ promote-base-latest │ crane copy
|
||
│ │ base-<hash>
|
||
│ │ → base-latest
|
||
└────────┬─────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────┐
|
||
│ update-description │
|
||
└──────────────────────────┘
|
||
```
|
||
|
||
### Step 1: `base-decide` (and `resolve-versions` in parallel)
|
||
|
||
**`base-decide`** computes a SHA-256 hash over the inputs that determine
|
||
the base image's content:
|
||
|
||
```sh
|
||
{
|
||
cat Dockerfile.base
|
||
find rootfs -type f \
|
||
! -path '*/__pycache__/*' \
|
||
! -name '*.pyc' \
|
||
! -name '.DS_Store' \
|
||
! -name '._*' \
|
||
-print0 | sort -z | xargs -0 cat
|
||
cat entrypoint.sh entrypoint-user.sh
|
||
} | sha256sum | cut -c1-12
|
||
```
|
||
|
||
Junk filters keep the local recompute reproducible against CI's clean
|
||
checkout — `__pycache__/*.pyc` and macOS metadata files (`.DS_Store`,
|
||
`._AppleDouble`) are gitignored but still walked by `find -type f`.
|
||
|
||
The 12-character truncated hash becomes `base-<hash>`. Probe Docker Hub
|
||
for this tag via `docker manifest inspect`:
|
||
|
||
- If it exists → set `need_build=false`. `build-base` is skipped entirely.
|
||
- If it doesn't → set `need_build=true`. `build-base` runs.
|
||
|
||
This is the core cache-reuse mechanism. Version-bump-only releases
|
||
(only `Dockerfile.variant` or build-args changed) hit the cache. Releases
|
||
that change anything in the base — apt packages, AWS CLI, Node version,
|
||
locale list, entrypoint scripts — pay the full base-build cost once.
|
||
|
||
**`resolve-versions`** runs alongside `base-decide` (no `needs:`
|
||
dependency between them) and resolves the floating npm packages whose
|
||
`*_VERSION` build-args default to `latest`:
|
||
|
||
```sh
|
||
PI_VERSION=$(npm view @earendil-works/pi-coding-agent version)
|
||
OMOS_VERSION=$(npm view oh-my-opencode-slim version)
|
||
```
|
||
|
||
The outputs (`pi_version`, `omos_version`) are consumed by every variant
|
||
smoke and build job that installs pi or omos. **Why this exists:** without
|
||
it, the `npm install -g` RUN layer in `Dockerfile.variant` hashes
|
||
identically across builds (same ARG default, same command string), so
|
||
the registry buildcache silently reuses the layer from whatever upstream
|
||
version was current when the cache was first populated. This is the
|
||
cache-hit silent-regression class of bug that shipped pi-devbox v0.74.0
|
||
through v0.75.5 with identical image bytes (fixed in pi-devbox v0.75.5b
|
||
2026-05-23). Currently masked here by `OPENCODE_VERSION` bumping every
|
||
release (parent-chain cache-key invalidation), but masking would fail on
|
||
a `vN.N.Nb` opencode-version-unchanged release that only bumps pi or
|
||
omos. Smoke jobs additionally assert `EXPECTED_PI_VERSION` /
|
||
`EXPECTED_OMOS_VERSION` against the resolved values.
|
||
|
||
### Step 2: `build-base` (conditional)
|
||
|
||
Only runs when `need_build=true`. Multi-arch (amd64 + arm64) build of
|
||
`Dockerfile.base`, pushed to `joakimp/opencode-devbox:base-<hash>`.
|
||
Registry cache via `--cache-from/--cache-to` reduces incremental rebuilds
|
||
when only one or two layers changed.
|
||
|
||
The base image is **not** tagged `base-latest` here — that promotion
|
||
happens at the very end after all variants succeed (see step 5).
|
||
|
||
### Step 3: `smoke-*` (×4, parallel)
|
||
|
||
For each variant: build amd64-only against the base tag, load into
|
||
local docker, run [`scripts/smoke-test.sh`](../scripts/smoke-test.sh).
|
||
Variant build-args:
|
||
|
||
| variant | INSTALL_OPENCODE | INSTALL_OMOS | INSTALL_PI |
|
||
|---|---|---|---|
|
||
| `base` | true | false | false |
|
||
| `omos` | true | true | false |
|
||
| `with-pi` | true | false | true |
|
||
| `omos-with-pi` | true | true | true |
|
||
|
||
Smoke runs `--variant <name>` to enable variant-specific assertions.
|
||
Gate the publish: a smoke failure for variant X blocks `build-variant-X`.
|
||
|
||
### Step 4: `build-variant-*` (×4, parallel)
|
||
|
||
For each variant that passed smoke: multi-arch (amd64 + arm64) build of
|
||
`Dockerfile.variant`, pushed to Docker Hub with the user-facing release
|
||
tags:
|
||
|
||
| Build job | Tags pushed |
|
||
|---|---|
|
||
| `build-variant-base` | `vX.Y.Z`, `latest` |
|
||
| `build-variant-omos` | `vX.Y.Z-omos`, `latest-omos` |
|
||
| `build-variant-with-pi` | `vX.Y.Z-with-pi`, `latest-with-pi` |
|
||
| `build-variant-omos-with-pi` | `vX.Y.Z-omos-with-pi`, `latest-omos-with-pi` |
|
||
|
||
The `latest*` aliases are only updated when `promote_latest=true` (the
|
||
manual dispatch input) — for test runs, `promote_latest=false` keeps the
|
||
production aliases pointing at the previous good release.
|
||
|
||
### Step 5: `promote-base-latest`
|
||
|
||
Once all five variants successfully publish, re-tag `base-<hash>` as
|
||
`base-latest` using `crane copy`. This is a **manifest-level re-tag, not
|
||
a rebuild** — it touches only Docker Hub's image index, takes seconds,
|
||
and is atomic.
|
||
|
||
The reason this happens *after* variants succeed (rather than alongside
|
||
`build-base`) is so a partial failure leaves `base-latest` pointing at
|
||
the previous known-good base. External consumers who pin to
|
||
`base-latest` (e.g. the planned pi-devbox repo) never see a broken base.
|
||
|
||
### Step 6: `update-description`
|
||
|
||
Push the generated `DOCKER_HUB.md` to the Hub repo's `full_description`
|
||
field via the Hub REST API. Same step as the production pipeline.
|
||
|
||
## NPM_CONFIG_PREFIX gotcha (variant override pattern)
|
||
|
||
The base sets
|
||
|
||
```
|
||
ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global
|
||
```
|
||
|
||
This is intentional — it makes `pi install npm:<pkg>` and `npm install -g`
|
||
land on the `devbox-pi-config` named volume at runtime, so user-installed
|
||
packages survive container recreate AND image rebuild.
|
||
|
||
But the *variant build* inherits this prefix at build time. If left as-is,
|
||
`npm install -g opencode-ai@$VERSION` in `Dockerfile.variant` would
|
||
install opencode into `/home/developer/.pi/npm-global/...`, which is then
|
||
**shadowed by the volume mount at runtime** → opencode disappears from
|
||
PATH on first start.
|
||
|
||
Fix: each `npm install -g` in `Dockerfile.variant` overrides the prefix
|
||
per-RUN:
|
||
|
||
```dockerfile
|
||
RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}
|
||
```
|
||
|
||
Baked binaries land on `/usr/bin/...` (system prefix), survive the volume
|
||
mount. Runtime-installed user packages still land on
|
||
`~/.pi/npm-global/...`. Both visible on PATH.
|
||
|
||
## Cache strategy
|
||
|
||
Two registry caches are configured:
|
||
|
||
```yaml
|
||
cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
|
||
cache-to: type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max
|
||
|
||
cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
|
||
cache-to: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max
|
||
```
|
||
|
||
`mode=max` exports cache for *all* layers, not just the final image's
|
||
layers. Important for multi-arch builds where the cross-arch layer reuse
|
||
matters more.
|
||
|
||
## Wall-clock estimates
|
||
|
||
| Scenario | Production pipeline | Split-base pipeline |
|
||
|---|---|---|
|
||
| Version-bump-only release (only opencode/pi/omos version changed) | ~165–180 min | **~30–40 min** (base cache hit) |
|
||
| Base-touching release (apt/Node/Debian/entrypoint change) | ~165–180 min | **~70–90 min** (base rebuilds) |
|
||
|
||
The split-base pipeline pays its dues on base-touching releases (which are
|
||
infrequent — a few times a year for Debian / Node major version bumps).
|
||
Most releases are version-bumps and ride the cache.
|
||
|
||
## Validate workflow
|
||
|
||
[`validate.yml`](workflows/validate.yml) is the lightweight gate that runs
|
||
on every push to `main` and on PRs. It:
|
||
|
||
1. Runs `scripts/generate-dockerhub-md.py --check` to enforce
|
||
`DOCKER_HUB.md` is in sync with `HUB_TEMPLATE`.
|
||
2. Builds each of the five variants amd64-only (no multi-arch, no push)
|
||
and runs `scripts/smoke-test.sh`.
|
||
|
||
This catches regressions before they reach a tag push. Wall clock ~30 min.
|
||
|
||
## Runner expectations
|
||
|
||
- **Image:** `catthehacker/ubuntu:act-latest`. Each job runs inside a
|
||
fresh container of this image. Don't assume any pre-installed
|
||
toolchains beyond what catthehacker ships.
|
||
- **Disk pressure:** the runner host has ~40 GB of usable overlay space,
|
||
often 70%+ used at job start. Every job that does `load: true` (smoke)
|
||
starts with a `Reclaim runner disk` step that strips
|
||
catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM,
|
||
Boost, Chromium, PowerShell) and prunes stale docker state. Don't
|
||
remove these steps without testing on a fresh runner.
|
||
- **Concurrency:** 2 runners. Jobs in the same workflow run can fan out to
|
||
both; jobs in *different* workflow runs are serialized by gitea's queue.
|
||
The `concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false }`
|
||
setting keeps tag pushes from racing each other but allows
|
||
per-PR/per-branch parallelism.
|
||
- **Workflow visibility in UI:** gitea Actions only surfaces workflows
|
||
from the **default branch** in the web UI's workflow list, even for
|
||
`workflow_dispatch` triggers. Workflows on feature branches are
|
||
invisible until merged to `main`.
|
||
- **Disk reclaim quirk:** `actions/{upload,download}-artifact@v4+` does
|
||
not work on Gitea (depends on a GitHub-only Artifact API). Stick to
|
||
`@v3` if matrix-fanout-with-artifacts is ever needed. We avoided this
|
||
by using `docker/build-push-action@v7` with comma-separated
|
||
`platforms: linux/amd64,linux/arm64` — natively does multi-arch push
|
||
in a single job, no artifact dance.
|
||
|
||
## Migration plan: split-base → production
|
||
|
||
1. **Validate the split-base dispatch.** Trigger
|
||
`docker-publish-split.yml` manually with `release_tag=v0.0.0-split-test`
|
||
and `promote_latest=false`. Confirm all jobs go green, image sizes
|
||
match the production baseline within ~10%, and no unexpected layer
|
||
rebuilds appear in `build-variant-*` logs after the FROM line.
|
||
2. **Run a second dispatch** to confirm cache-hit behavior:
|
||
`base-decide` should set `need_build=false`, `build-base` should be
|
||
skipped entirely, total wall clock should drop to ~25–40 min.
|
||
3. **Cut over** — *done as of v1.14.50.* `docker-publish-split.yml` now
|
||
triggers on `push: tags: v*`. `docker-publish.yml` and original
|
||
`Dockerfile` deleted.
|
||
4. **Tag a release.** First production release on the new pipeline.
|
||
|
||
## Related docs
|
||
|
||
- [`AGENTS.md`](../AGENTS.md) — domain facts, release-day checklist,
|
||
documentation coupling rules. Read first when modifying CI behavior.
|
||
- [`CHANGELOG.md`](../CHANGELOG.md) — build pipeline rewrite landed in v1.14.50.
|
||
- `Dockerfile.base`, `Dockerfile.variant` — the split-base Dockerfiles.
|
||
Comments at the top of each explain their role.
|
||
- [`scripts/smoke-test.sh`](../scripts/smoke-test.sh) — invoked by all
|
||
three workflows; this is the single source of truth for "what does a
|
||
built image have to satisfy".
|
||
- [`scripts/generate-dockerhub-md.py`](../scripts/generate-dockerhub-md.py)
|
||
— generates `DOCKER_HUB.md` from `HUB_TEMPLATE`. `--check` enforces
|
||
sync in `validate.yml`.
|