Belt-and-braces against transient registry-1.docker.io blips (rate limits, brief 5xx, CDN flap). Replaces all five push docker/build-push- action@v7 invocations (1 base + 4 variants) with shell: bash steps that run docker buildx build --push in a for-loop with backoff (15s, 30s). Smoke build steps (load: true, no push) are untouched. Does NOT mask deterministic failures: a true regression (e.g. the cache-export 400 we hit 2026-05-23..28) fails all 3 attempts identically and the job still fails by design. Orthogonal layer to both cache-export disablement and the ci-release-watcher skill's transient-rerun heuristic. - AGENTS.md: new Critical conventions bullet documenting the retry pattern, the consistency rule across push steps, and why it's duplicated rather than factored (Gitea Actions doesn't support reusable composite shell steps cleanly). - CHANGELOG.md: Unreleased section addendum, no image-side change. No image-side change.
CI / Build Pipeline
This directory contains the gitea Actions workflows and the supporting documentation for opencode-devbox's CI. If you're investigating why the build pipeline is shaped the way it is, you're in the right place.
Workflows in this directory
| File | Trigger | Role |
|---|---|---|
workflows/docker-publish-split.yml |
push: tags: v* |
Production release pipeline. Two-phase split-base build: shared base-<hash> published once (skipped on cache hit), then four parallel variant deltas. ~40–80 min wall clock depending on runner count and whether base needs rebuilding. |
workflows/validate.yml |
push: branches: main + PR |
Lightweight gate. amd64-only smoke test of all four variants + DOCKER_HUB.md sync check. ~30 min. Fires on every push to main. |
Why the split-base pipeline exists
opencode-devbox publishes four image variants (base, omos, with-pi, omos-with-pi) × two architectures (amd64, arm64) = eight image tags per release. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~3–5x slower than native).
The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original Dockerfile was a single multi-stage build with INSTALL_* build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated RUN produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.
Two improvements were considered:
- Reorder the original Dockerfile so all variant-gated RUNs land at the bottom — modest gain, ~10–20% wall-clock reduction. Not pursued.
- Split into
Dockerfile.base+Dockerfile.variantwith the base published as a long-lived shared image — significant gain, ~50–70% wall-clock reduction with hash-driven cache reuse. Pursued.
The split-base architecture is what the docker-publish-split.yml workflow exercises.
How the split-base pipeline works
┌──────────────────┐
│ base-decide │ compute base-<hash>;
│ │ probe Docker Hub.
│ hash inputs: │ (resolve-versions
│ Dockerfile.base│ runs in parallel:
│ rootfs/ │ npm view pi/omos
│ entrypoint*.sh │ → concrete versions)
└────────┬─────────┘
│
┌─────────────┴─────────────┐
│ need_build = true? │
└─────────────┬─────────────┘
yes │ no
▼
┌──────────────────┐
│ build-base │ multi-arch build,
│ │ push base-<hash>
└────────┬─────────┘ to Docker Hub.
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│smoke-base│ │smoke-omos│ ... │smoke-omos-pi │ amd64 only,
└────┬─────┘ └────┬─────┘ └──────┬───────┘ parallel.
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│build- │ │build- │ │build- │ multi-arch,
│variant- │ │variant- │ ... │variant- │ parallel,
│base │ │omos │ │omos-with-pi │ tag push.
└────┬─────┘ └────┬─────┘ └──────┬───────┘
└───────────────────────┴──────────────────────┘
│
▼
┌──────────────────────────┐
│ promote-base-latest │ crane copy
│ │ base-<hash>
│ │ → base-latest
└────────┬─────────────────┘
│
▼
┌──────────────────────────┐
│ update-description │
└──────────────────────────┘
Step 1: base-decide (and resolve-versions in parallel)
base-decide computes a SHA-256 hash over the inputs that determine
the base image's content:
{
cat Dockerfile.base
find rootfs -type f \
! -path '*/__pycache__/*' \
! -name '*.pyc' \
! -name '.DS_Store' \
! -name '._*' \
-print0 | sort -z | xargs -0 cat
cat entrypoint.sh entrypoint-user.sh
} | sha256sum | cut -c1-12
Junk filters keep the local recompute reproducible against CI's clean
checkout — __pycache__/*.pyc and macOS metadata files (.DS_Store,
._AppleDouble) are gitignored but still walked by find -type f.
The 12-character truncated hash becomes base-<hash>. Probe Docker Hub
for this tag via docker manifest inspect:
- If it exists → set
need_build=false.build-baseis skipped entirely. - If it doesn't → set
need_build=true.build-baseruns.
This is the core cache-reuse mechanism. Version-bump-only releases
(only Dockerfile.variant or build-args changed) hit the cache. Releases
that change anything in the base — apt packages, AWS CLI, Node version,
locale list, entrypoint scripts — pay the full base-build cost once.
resolve-versions runs alongside base-decide (no needs:
dependency between them) and resolves the floating npm packages whose
*_VERSION build-args default to latest:
PI_VERSION=$(npm view @earendil-works/pi-coding-agent version)
OMOS_VERSION=$(npm view oh-my-opencode-slim version)
The outputs (pi_version, omos_version) are consumed by every variant
smoke and build job that installs pi or omos. Why this exists: without
it, the npm install -g RUN layer in Dockerfile.variant hashes
identically across builds (same ARG default, same command string), so
the registry buildcache silently reuses the layer from whatever upstream
version was current when the cache was first populated. This is the
cache-hit silent-regression class of bug that shipped pi-devbox v0.74.0
through v0.75.5 with identical image bytes (fixed in pi-devbox v0.75.5b
2026-05-23). Currently masked here by OPENCODE_VERSION bumping every
release (parent-chain cache-key invalidation), but masking would fail on
a vN.N.Nb opencode-version-unchanged release that only bumps pi or
omos. Smoke jobs additionally assert EXPECTED_PI_VERSION /
EXPECTED_OMOS_VERSION against the resolved values.
Step 2: build-base (conditional)
Only runs when need_build=true. Multi-arch (amd64 + arm64) build of
Dockerfile.base, pushed to joakimp/opencode-devbox:base-<hash>.
Registry cache via --cache-from/--cache-to reduces incremental rebuilds
when only one or two layers changed.
The base image is not tagged base-latest here — that promotion
happens at the very end after all variants succeed (see step 5).
Step 3: smoke-* (×4, parallel)
For each variant: build amd64-only against the base tag, load into
local docker, run scripts/smoke-test.sh.
Variant build-args:
| variant | INSTALL_OPENCODE | INSTALL_OMOS | INSTALL_PI |
|---|---|---|---|
base |
true | false | false |
omos |
true | true | false |
with-pi |
true | false | true |
omos-with-pi |
true | true | true |
Smoke runs --variant <name> to enable variant-specific assertions.
Gate the publish: a smoke failure for variant X blocks build-variant-X.
Step 4: build-variant-* (×4, parallel)
For each variant that passed smoke: multi-arch (amd64 + arm64) build of
Dockerfile.variant, pushed to Docker Hub with the user-facing release
tags:
| Build job | Tags pushed |
|---|---|
build-variant-base |
vX.Y.Z, latest |
build-variant-omos |
vX.Y.Z-omos, latest-omos |
build-variant-with-pi |
vX.Y.Z-with-pi, latest-with-pi |
build-variant-omos-with-pi |
vX.Y.Z-omos-with-pi, latest-omos-with-pi |
The latest* aliases are only updated when promote_latest=true (the
manual dispatch input) — for test runs, promote_latest=false keeps the
production aliases pointing at the previous good release.
Step 5: promote-base-latest
Once all four variants successfully publish, re-tag base-<hash> as
base-latest using crane copy. This is a manifest-level re-tag, not
a rebuild — it touches only Docker Hub's image index, takes seconds,
and is atomic.
The reason this happens after variants succeed (rather than alongside
build-base) is so a partial failure leaves base-latest pointing at
the previous known-good base. External consumers who pin to
base-latest (e.g. the planned pi-devbox repo) never see a broken base.
Step 6: update-description
Push the generated DOCKER_HUB.md to the Hub repo's full_description
field via the Hub REST API. Same step as the production pipeline.
NPM_CONFIG_PREFIX gotcha (variant override pattern)
The base sets
ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global
This is intentional — it makes pi install npm:<pkg> and npm install -g
land on the devbox-pi-config named volume at runtime, so user-installed
packages survive container recreate AND image rebuild.
But the variant build inherits this prefix at build time. If left as-is,
npm install -g opencode-ai@$VERSION in Dockerfile.variant would
install opencode into /home/developer/.pi/npm-global/..., which is then
shadowed by the volume mount at runtime → opencode disappears from
PATH on first start.
Fix: each npm install -g in Dockerfile.variant overrides the prefix
per-RUN:
RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}
Baked binaries land on /usr/bin/... (system prefix), survive the volume
mount. Runtime-installed user packages still land on
~/.pi/npm-global/.... Both visible on PATH.
Cache strategy
Two registry caches are configured:
cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
cache-to: type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max
cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
cache-to: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max
mode=max exports cache for all layers, not just the final image's
layers. Important for multi-arch builds where the cross-arch layer reuse
matters more.
Wall-clock estimates
| Scenario | Production pipeline | Split-base pipeline |
|---|---|---|
| Version-bump-only release (only opencode/pi/omos version changed) | ~165–180 min | ~30–40 min (base cache hit) |
| Base-touching release (apt/Node/Debian/entrypoint change) | ~165–180 min | ~70–90 min (base rebuilds) |
The split-base pipeline pays its dues on base-touching releases (which are infrequent — a few times a year for Debian / Node major version bumps). Most releases are version-bumps and ride the cache.
Validate workflow
validate.yml is the lightweight gate that runs
on every push to main and on PRs. It:
- Runs
scripts/generate-dockerhub-md.py --checkto enforceDOCKER_HUB.mdis in sync withHUB_TEMPLATE. - Builds each of the four variants amd64-only (no multi-arch, no push)
and runs
scripts/smoke-test.sh.
This catches regressions before they reach a tag push. Wall clock ~30 min.
Runner expectations
- Image:
catthehacker/ubuntu:act-latest. Each job runs inside a fresh container of this image. Don't assume any pre-installed toolchains beyond what catthehacker ships. - Disk pressure: the runner host has ~40 GB of usable overlay space,
often 70%+ used at job start. Every job that does
load: true(smoke) starts with aReclaim runner diskstep that strips catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM, Boost, Chromium, PowerShell) and prunes stale docker state. Don't remove these steps without testing on a fresh runner. - Concurrency: 2 runners. Jobs in the same workflow run can fan out to
both; jobs in different workflow runs are serialized by gitea's queue.
The
concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false }setting keeps tag pushes from racing each other but allows per-PR/per-branch parallelism. - Workflow visibility in UI: gitea Actions only surfaces workflows
from the default branch in the web UI's workflow list, even for
workflow_dispatchtriggers. Workflows on feature branches are invisible until merged tomain. - Disk reclaim quirk:
actions/{upload,download}-artifact@v4+does not work on Gitea (depends on a GitHub-only Artifact API). Stick to@v3if matrix-fanout-with-artifacts is ever needed. We avoided this by usingdocker/build-push-action@v7with comma-separatedplatforms: linux/amd64,linux/arm64— natively does multi-arch push in a single job, no artifact dance.
Migration plan: split-base → production
- Validate the split-base dispatch. Trigger
docker-publish-split.ymlmanually withrelease_tag=v0.0.0-split-testandpromote_latest=false. Confirm all jobs go green, image sizes match the production baseline within ~10%, and no unexpected layer rebuilds appear inbuild-variant-*logs after the FROM line. - Run a second dispatch to confirm cache-hit behavior:
base-decideshould setneed_build=false,build-baseshould be skipped entirely, total wall clock should drop to ~25–40 min. - Cut over — done as of v1.14.50.
docker-publish-split.ymlnow triggers onpush: tags: v*.docker-publish.ymland originalDockerfiledeleted. - Tag a release. First production release on the new pipeline.
Related docs
AGENTS.md— domain facts, release-day checklist, documentation coupling rules. Read first when modifying CI behavior.CHANGELOG.md— build pipeline rewrite landed in v1.14.50.Dockerfile.base,Dockerfile.variant— the split-base Dockerfiles. Comments at the top of each explain their role.scripts/smoke-test.sh— invoked by all three workflows; this is the single source of truth for "what does a built image have to satisfy".scripts/generate-dockerhub-md.py— generatesDOCKER_HUB.mdfromHUB_TEMPLATE.--checkenforces sync invalidate.yml.