Files
opencode-devbox/.gitea/README.md
T
pi e963f83e70
Validate / docs-check (push) Successful in 7s
Validate / base-change-warning (push) Successful in 58s
Validate / validate-base (push) Successful in 3m19s
Validate / validate-omos (push) Successful in 4m19s
ci: CI-resolve mempalace-toolkit to a pinned SHA
mempalace-toolkit is the only dependency cloned in Dockerfile.base (all
others live in the variant), so it bypassed the resolve-versions ->
build-arg plumbing and its ref stayed a literal `main`. Because the base
only rebuilds on a content hash, a toolkit-only fix would silently fail to
land unless Dockerfile.base itself changed.

Mirrors pi-devbox commit 4744f05, adapted to this repo:
- resolve-versions: new mempalace_toolkit_ref output via the gitea commits
  API (first gitea call in this repo's CI; works unauthenticated, no secret).
- base-decide: needs resolve-versions; fold the SHA into the base-tag hash
  so a moved toolkit forces a base rebuild (they no longer run in parallel).
- build-base: needs resolve-versions; pass --build-arg MEMPALACE_TOOLKIT_REF.
- Dockerfile.base: clone switched to SHA-capable git fetch + checkout
  FETCH_HEAD (git clone --branch <SHA> would fail).
- docs lockstep: .gitea/README.md Step 1 (no longer "in parallel"), AGENTS.md
  Critical conventions, CHANGELOG Unreleased.

base_tag now reflects a live gitea lookup; on API blip it falls back to
`main`, triggering one extra rebuild, never a missed one. No new tag —
lands on the next release or workflow_dispatch.
2026-06-14 15:51:55 +02:00

16 KiB
Raw Blame History

CI / Build Pipeline

This directory contains the gitea Actions workflows and the supporting documentation for opencode-devbox's CI. If you're investigating why the build pipeline is shaped the way it is, you're in the right place.

Workflows in this directory

File Trigger Role
workflows/docker-publish-split.yml push: tags: v* Production release pipeline. Two-phase split-base build: shared base-<hash> published once (skipped on cache hit), then two parallel variant deltas. ~4080 min wall clock depending on runner count and whether base needs rebuilding.
workflows/validate.yml push: branches: main + PR Lightweight gate. amd64-only smoke test of both variants + DOCKER_HUB.md sync check. ~30 min. Fires on every push to main.

Why the split-base pipeline exists

opencode-devbox builds two image variants (base, omos) × two architectures (amd64, arm64), publishing four tags per release + the floating base-latest. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~35x slower than native).

pi was removed in v2.0.0; it now builds in its own joakimp/pi-devbox repo. Before v2.0.0 a fifth pi-only build was produced here and pushed into that repo as base-pi-only — that coupling is gone.

The two variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original Dockerfile was a single multi-stage build with INSTALL_* build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated RUN produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.

Two improvements were considered:

  1. Reorder the original Dockerfile so all variant-gated RUNs land at the bottom — modest gain, ~1020% wall-clock reduction. Not pursued.
  2. Split into Dockerfile.base + Dockerfile.variant with the base published as a long-lived shared image — significant gain, ~5070% wall-clock reduction with hash-driven cache reuse. Pursued.

The split-base architecture is what the docker-publish-split.yml workflow exercises.

How the split-base pipeline works

                       ┌──────────────────┐
                       │  base-decide     │   compute base-<hash>;
                       │                  │   probe Docker Hub.
                       │  hash inputs:    │   (resolve-versions
                       │   Dockerfile.base│   runs in parallel:
                       │   rootfs/        │   npm view omos
                       │   entrypoint*.sh │   → concrete version)
                       └────────┬─────────┘
                                │
                  ┌─────────────┴─────────────┐
                  │ need_build = true?        │
                  └─────────────┬─────────────┘
                       yes      │       no
                                ▼
                       ┌──────────────────┐
                       │  build-base      │   multi-arch build,
                       │                  │   push base-<hash>
                       └────────┬─────────┘   to Docker Hub.
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼
   ┌──────────┐            ┌──────────┐
   │smoke-base│            │smoke-omos│             amd64 only,
   └────┬─────┘            └────┬─────┘             parallel.
        │                       │
        ▼                       ▼
   ┌──────────┐            ┌──────────┐
   │build-    │            │build-    │             multi-arch,
   │variant-  │            │variant-  │             parallel,
   │base      │            │omos      │             tag push.
   └────┬─────┘            └────┬─────┘
        └───────────┬───────────┘
                                │
                                ▼
                  ┌──────────────────────────┐
                  │  promote-base-latest     │   crane copy
                  │                          │   base-<hash>
                  │                          │   → base-latest
                  └────────┬─────────────────┘
                           │
                           ▼
                  ┌──────────────────────────┐
                  │  update-description      │
                  └──────────────────────────┘

Step 1: resolve-versions, then base-decide

resolve-versions resolves floating refs to concrete values: omos_version (npm latest) and mempalace_toolkit_ref (the mempalace-toolkit main HEAD resolved to a commit SHA via the gitea commits API). base-decide now depends on resolve-versions (they no longer run in parallel) because it folds mempalace_toolkit_ref into the base hash — see below.

base-decide computes a SHA-256 hash over the inputs that determine the base image's content:

{
  cat Dockerfile.base
  find rootfs -type f \
    ! -path '*/__pycache__/*' \
    ! -name '*.pyc' \
    ! -name '.DS_Store' \
    ! -name '._*' \
    -print0 | sort -z | xargs -0 cat
  cat entrypoint.sh entrypoint-user.sh
  echo "$mempalace_toolkit_ref"   # CI-resolved SHA; mempalace-toolkit is
                                  # cloned in Dockerfile.base, so a moved
                                  # toolkit must force a base rebuild
} | sha256sum | cut -c1-12

Junk filters keep the local recompute reproducible against CI's clean checkout — __pycache__/*.pyc and macOS metadata files (.DS_Store, ._AppleDouble) are gitignored but still walked by find -type f.

The 12-character truncated hash becomes base-<hash>. Probe Docker Hub for this tag via docker manifest inspect:

  • If it exists → set need_build=false. build-base is skipped entirely.
  • If it doesn't → set need_build=true. build-base runs.

This is the core cache-reuse mechanism. Version-bump-only releases (only Dockerfile.variant or build-args changed) hit the cache. Releases that change anything in the base — apt packages, AWS CLI, Node version, locale list, entrypoint scripts — pay the full base-build cost once.

resolve-versions runs alongside base-decide (no needs: dependency between them) and resolves the floating npm packages whose *_VERSION build-args default to latest:

OMOS_VERSION=$(npm view oh-my-opencode-slim version)

The output (omos_version) is consumed by the omos variant smoke and build jobs. Why this exists: without it, the npm install -g RUN layer in Dockerfile.variant hashes identically across builds (same ARG default, same command string), so the registry buildcache silently reuses the layer from whatever upstream version was current when the cache was first populated. This is the cache-hit silent-regression class of bug that shipped pi-devbox v0.74.0 through v0.75.5 with identical image bytes (fixed in pi-devbox v0.75.5b 2026-05-23). Currently masked here by OPENCODE_VERSION bumping every release (parent-chain cache-key invalidation), but masking would fail on a vN.N.Nb opencode-version-unchanged release that only bumps omos. Smoke jobs additionally assert EXPECTED_OMOS_VERSION against the resolved value.

Step 2: build-base (conditional)

Only runs when need_build=true. Multi-arch (amd64 + arm64) build of Dockerfile.base, pushed to joakimp/opencode-devbox:base-<hash>. Registry cache via --cache-from/--cache-to reduces incremental rebuilds when only one or two layers changed.

The base image is not tagged base-latest here — that promotion happens at the very end after all variants succeed (see step 5).

Step 3: smoke-* (×2, parallel)

For each variant: build amd64-only against the base tag, load into local docker, run scripts/smoke-test.sh. Variant build-args:

variant INSTALL_OPENCODE INSTALL_OMOS
base true false
omos true true

Smoke runs --variant <name> to enable variant-specific assertions. Gate the publish: a smoke failure for variant X blocks build-variant-X.

Step 4: build-variant-* (×2, parallel)

For each variant that passed smoke: multi-arch (amd64 + arm64) build of Dockerfile.variant, pushed to Docker Hub with the user-facing release tags:

Build job Tags pushed
build-variant-base vX.Y.Z, latest
build-variant-omos vX.Y.Z-omos, latest-omos

The latest* aliases are only updated when promote_latest=true (the manual dispatch input) — for test runs, promote_latest=false keeps the production aliases pointing at the previous good release.

Step 5: promote-base-latest

Once both variants successfully publish, re-tag base-<hash> as base-latest using crane copy. This is a manifest-level re-tag, not a rebuild — it touches only Docker Hub's image index, takes seconds, and is atomic.

The reason this happens after variants succeed (rather than alongside build-base) is so a partial failure leaves base-latest pointing at the previous known-good base. External consumers who pin to base-latest never see a broken base.

Step 6: update-description

Push the generated DOCKER_HUB.md to the Hub repo's full_description field via the Hub REST API. Same step as the production pipeline.

NPM_CONFIG_PREFIX gotcha (variant override pattern)

The base sets

ENV NPM_CONFIG_PREFIX=/home/developer/.config/opencode/npm-global

This is intentional — it makes npm install -g land on the devbox-opencode-config named volume at runtime, so user-installed packages survive container recreate AND image rebuild. (Before v2.0.0 this prefix lived at ~/.pi/npm-global on the now-removed devbox-pi-config volume; entrypoint-user.sh migrates the old path once.)

But the variant build inherits this prefix at build time. If left as-is, npm install -g opencode-ai@$VERSION in Dockerfile.variant would install opencode into /home/developer/.config/opencode/npm-global/..., which is then shadowed by the volume mount at runtime → opencode disappears from PATH on first start.

Fix: each npm install -g in Dockerfile.variant overrides the prefix per-RUN:

RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}

Baked binaries land on /usr/bin/... (system prefix), survive the volume mount. Runtime-installed user packages still land on ~/.config/opencode/npm-global/.... Both visible on PATH.

Cache strategy

Two registry caches are configured:

cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max

cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max

mode=max exports cache for all layers, not just the final image's layers. Important for multi-arch builds where the cross-arch layer reuse matters more.

Wall-clock estimates

Scenario Production pipeline Split-base pipeline
Version-bump-only release (only opencode/omos version changed) ~165180 min ~3040 min (base cache hit)
Base-touching release (apt/Node/Debian/entrypoint change) ~165180 min ~7090 min (base rebuilds)

The split-base pipeline pays its dues on base-touching releases (which are infrequent — a few times a year for Debian / Node major version bumps). Most releases are version-bumps and ride the cache.

Validate workflow

validate.yml is the lightweight gate that runs on every push to main and on PRs. It:

  1. Runs scripts/generate-dockerhub-md.py --check to enforce DOCKER_HUB.md is in sync with HUB_TEMPLATE.
  2. Builds each of the two variants amd64-only (no multi-arch, no push) and runs scripts/smoke-test.sh.

This catches regressions before they reach a tag push. Wall clock ~30 min.

Runner expectations

  • Image: catthehacker/ubuntu:act-latest. Each job runs inside a fresh container of this image. Don't assume any pre-installed toolchains beyond what catthehacker ships.
  • Disk pressure: the runner host has ~40 GB of usable overlay space, often 70%+ used at job start. Every job that does load: true (smoke) starts with a Reclaim runner disk step that strips catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM, Boost, Chromium, PowerShell) and prunes stale docker state. Don't remove these steps without testing on a fresh runner.
  • Concurrency: 2 runners. Jobs in the same workflow run can fan out to both; jobs in different workflow runs are serialized by gitea's queue. The concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false } setting keeps tag pushes from racing each other but allows per-PR/per-branch parallelism.
  • Workflow visibility in UI: gitea Actions only surfaces workflows from the default branch in the web UI's workflow list, even for workflow_dispatch triggers. Workflows on feature branches are invisible until merged to main.
  • Disk reclaim quirk: actions/{upload,download}-artifact@v4+ does not work on Gitea (depends on a GitHub-only Artifact API). Stick to @v3 if matrix-fanout-with-artifacts is ever needed. We avoided this by using docker/build-push-action@v7 with comma-separated platforms: linux/amd64,linux/arm64 — natively does multi-arch push in a single job, no artifact dance.

Migration plan: split-base → production

  1. Validate the split-base dispatch. Trigger docker-publish-split.yml manually with release_tag=v0.0.0-split-test and promote_latest=false. Confirm all jobs go green, image sizes match the production baseline within ~10%, and no unexpected layer rebuilds appear in build-variant-* logs after the FROM line.
  2. Run a second dispatch to confirm cache-hit behavior: base-decide should set need_build=false, build-base should be skipped entirely, total wall clock should drop to ~2540 min.
  3. Cut overdone as of v1.14.50. docker-publish-split.yml now triggers on push: tags: v*. docker-publish.yml and original Dockerfile deleted.
  4. Tag a release. First production release on the new pipeline.
  • AGENTS.md — domain facts, release-day checklist, documentation coupling rules. Read first when modifying CI behavior.
  • CHANGELOG.md — build pipeline rewrite landed in v1.14.50.
  • Dockerfile.base, Dockerfile.variant — the split-base Dockerfiles. Comments at the top of each explain their role.
  • scripts/smoke-test.sh — invoked by all three workflows; this is the single source of truth for "what does a built image have to satisfy".
  • scripts/generate-dockerhub-md.py — generates DOCKER_HUB.md from HUB_TEMPLATE. --check enforces sync in validate.yml.