Files
opencode-devbox/.gitea/README.md
T
joakimp 6fde27c212
Validate / docs-check (push) Successful in 16s
Validate / validate-base (push) Successful in 12m9s
Validate / validate-omos (push) Successful in 16m45s
Validate / validate-with-pi (push) Successful in 13m30s
Validate / validate-omos-with-pi (push) Successful in 15m15s
Document the build pipeline architecture in .gitea/README.md
The split-base build architecture, the NPM_CONFIG_PREFIX gotcha, the
hash-driven base cache reuse mechanism, and the cutover plan from
docker-publish.yml to docker-publish-split.yml were previously
scattered across:
  - inline Dockerfile.base / Dockerfile.variant comments
  - CHANGELOG Unreleased entries
  - AGENTS.md mentions
  - docker-publish-split.yml header comment
  - my own session notes

Consolidate into .gitea/README.md as the canonical architectural doc.
Gitea (like GitHub) auto-renders this when navigating to .gitea/ in
the web UI, so anyone investigating 'why is CI shaped this way?'
finds it on the first click. Cross-referenced from AGENTS.md as the
first thing to read when touching CI.

Covers:
  - The two release pipelines and why both exist
  - Why split-base: cross-variant cache misses on layer-hash-divergence
  - The 6 phases of the split-base pipeline with an ASCII diagram
  - base-decide hash inputs and Docker Hub probe logic
  - NPM_CONFIG_PREFIX variant-override pattern (the volume-shadow trap)
  - Registry cache strategy (mode=max for cross-arch reuse)
  - Wall-clock estimates: version-bump vs base-touching releases
  - Validate workflow role
  - Runner expectations: catthehacker image, disk reclaim, concurrency,
    Gitea Actions @v4 artifact incompatibility
  - 4-step migration plan from docker-publish.yml to .split.yml
  - Cross-refs to related docs

Does not duplicate AGENTS.md content; links to it for domain facts and
release-day checklist.
2026-05-09 19:28:03 +02:00

15 KiB
Raw Blame History

CI / Build Pipeline

This directory contains the gitea Actions workflows and the supporting documentation for opencode-devbox's CI. If you're investigating why the build pipeline is shaped the way it is, you're in the right place.

Workflows in this directory

File Trigger Role
workflows/docker-publish.yml push: tags: v* Production release pipeline. Multi-arch build of all four variants (base, omos, with-pi, omos-with-pi), publish to Docker Hub, update Hub description. ~165180 min wall clock.
workflows/docker-publish-split.yml workflow_dispatch (manual) Experimental split-base pipeline. Two-phase build: shared base-<hash> published once, then four thin variant deltas. Estimated ~3040 min on cache hit, ~7090 min when base needs rebuilding. Not yet validated end-to-end; once 12 dispatch test runs prove it, this will take over on: push: tags: v* and docker-publish.yml will be retired.
workflows/validate.yml push: branches: main + PR Lightweight gate. amd64-only smoke test of all four variants + DOCKER_HUB.md sync check. ~30 min. Fires on every push to main.

Why two release pipelines exist

opencode-devbox publishes four image variants (base, omos, with-pi, omos-with-pi) × two architectures (amd64, arm64) = eight image tags per release. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~35x slower than native).

The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original Dockerfile was a single multi-stage build with INSTALL_* build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated RUN produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.

Two improvements were considered:

  1. Reorder the original Dockerfile so all variant-gated RUNs land at the bottom — modest gain, ~1020% wall-clock reduction. Not pursued.
  2. Split into Dockerfile.base + Dockerfile.variant with the base published as a long-lived shared image — significant gain, ~5070% wall-clock reduction with hash-driven cache reuse. Pursued.

The split-base architecture is what the docker-publish-split.yml workflow exercises.

How the split-base pipeline works

                       ┌──────────────────┐
                       │  base-decide     │   compute base-<hash>;
                       │                  │   probe Docker Hub.
                       │  hash inputs:    │
                       │   Dockerfile.base│
                       │   rootfs/        │
                       │   entrypoint*.sh │
                       └────────┬─────────┘
                                │
                  ┌─────────────┴─────────────┐
                  │ need_build = true?        │
                  └─────────────┬─────────────┘
                       yes      │       no
                                ▼
                       ┌──────────────────┐
                       │  build-base      │   multi-arch build,
                       │                  │   push base-<hash>
                       └────────┬─────────┘   to Docker Hub.
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
   ┌──────────┐            ┌──────────┐         ┌──────────────┐
   │smoke-base│            │smoke-omos│   ...   │smoke-omos-pi │   amd64 only,
   └────┬─────┘            └────┬─────┘         └──────┬───────┘   parallel.
        │                       │                      │
        ▼                       ▼                      ▼
   ┌──────────┐            ┌──────────┐         ┌──────────────┐
   │build-    │            │build-    │         │build-        │   multi-arch,
   │variant-  │            │variant-  │   ...   │variant-      │   parallel,
   │base      │            │omos      │         │omos-with-pi  │   tag push.
   └────┬─────┘            └────┬─────┘         └──────┬───────┘
        └───────────────────────┴──────────────────────┘
                                │
                                ▼
                  ┌──────────────────────────┐
                  │  promote-base-latest     │   crane copy
                  │                          │   base-<hash>
                  │                          │   → base-latest
                  └────────┬─────────────────┘
                           │
                           ▼
                  ┌──────────────────────────┐
                  │  update-description      │
                  └──────────────────────────┘

Step 1: base-decide

Compute a SHA-256 hash over the inputs that determine the base image's content:

{
  cat Dockerfile.base
  find rootfs -type f -print0 | sort -z | xargs -0 cat
  cat entrypoint.sh entrypoint-user.sh
} | sha256sum | cut -c1-12

The 12-character truncated hash becomes base-<hash>. Probe Docker Hub for this tag via docker manifest inspect:

  • If it exists → set need_build=false. build-base is skipped entirely.
  • If it doesn't → set need_build=true. build-base runs.

This is the core cache-reuse mechanism. Version-bump-only releases (only Dockerfile.variant or build-args changed) hit the cache. Releases that change anything in the base — apt packages, AWS CLI, Node version, locale list, entrypoint scripts — pay the full base-build cost once.

Step 2: build-base (conditional)

Only runs when need_build=true. Multi-arch (amd64 + arm64) build of Dockerfile.base, pushed to joakimp/opencode-devbox:base-<hash>. Registry cache via --cache-from/--cache-to reduces incremental rebuilds when only one or two layers changed.

The base image is not tagged base-latest here — that promotion happens at the very end after all variants succeed (see step 5).

Step 3: smoke-* (×4, parallel)

For each variant: build amd64-only against the base tag, load into local docker, run scripts/smoke-test.sh. Variant build-args:

variant INSTALL_OPENCODE INSTALL_OMOS INSTALL_PI
base true false false
omos true true false
with-pi true false true
omos-with-pi true true true

Smoke runs --variant <name> to enable variant-specific assertions. Gate the publish: a smoke failure for variant X blocks build-variant-X.

Step 4: build-variant-* (×4, parallel)

For each variant that passed smoke: multi-arch (amd64 + arm64) build of Dockerfile.variant, pushed to Docker Hub with the user-facing release tags:

Build job Tags pushed
build-variant-base vX.Y.Z, latest
build-variant-omos vX.Y.Z-omos, latest-omos
build-variant-with-pi vX.Y.Z-with-pi, latest-with-pi
build-variant-omos-with-pi vX.Y.Z-omos-with-pi, latest-omos-with-pi

The latest* aliases are only updated when promote_latest=true (the manual dispatch input) — for test runs, promote_latest=false keeps the production aliases pointing at the previous good release.

Step 5: promote-base-latest

Once all four variants successfully publish, re-tag base-<hash> as base-latest using crane copy. This is a manifest-level re-tag, not a rebuild — it touches only Docker Hub's image index, takes seconds, and is atomic.

The reason this happens after variants succeed (rather than alongside build-base) is so a partial failure leaves base-latest pointing at the previous known-good base. External consumers who pin to base-latest (e.g. the planned pi-devbox repo) never see a broken base.

Step 6: update-description

Push the generated DOCKER_HUB.md to the Hub repo's full_description field via the Hub REST API. Same step as the production pipeline.

NPM_CONFIG_PREFIX gotcha (variant override pattern)

The base sets

ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global

This is intentional — it makes pi install npm:<pkg> and npm install -g land on the devbox-pi-config named volume at runtime, so user-installed packages survive container recreate AND image rebuild.

But the variant build inherits this prefix at build time. If left as-is, npm install -g opencode-ai@$VERSION in Dockerfile.variant would install opencode into /home/developer/.pi/npm-global/..., which is then shadowed by the volume mount at runtime → opencode disappears from PATH on first start.

Fix: each npm install -g in Dockerfile.variant overrides the prefix per-RUN:

RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}

Baked binaries land on /usr/bin/... (system prefix), survive the volume mount. Runtime-installed user packages still land on ~/.pi/npm-global/.... Both visible on PATH.

Cache strategy

Two registry caches are configured:

cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max

cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max

mode=max exports cache for all layers, not just the final image's layers. Important for multi-arch builds where the cross-arch layer reuse matters more.

Wall-clock estimates

Scenario Production pipeline Split-base pipeline
Version-bump-only release (only opencode/pi/omos version changed) ~165180 min ~3040 min (base cache hit)
Base-touching release (apt/Node/Debian/entrypoint change) ~165180 min ~7090 min (base rebuilds)

The split-base pipeline pays its dues on base-touching releases (which are infrequent — a few times a year for Debian / Node major version bumps). Most releases are version-bumps and ride the cache.

Validate workflow

validate.yml is the lightweight gate that runs on every push to main and on PRs. It:

  1. Runs scripts/generate-dockerhub-md.py --check to enforce DOCKER_HUB.md is in sync with HUB_TEMPLATE.
  2. Builds each of the four variants amd64-only (no multi-arch, no push) and runs scripts/smoke-test.sh.

This catches regressions before they reach a tag push. Wall clock ~30 min.

Runner expectations

  • Image: catthehacker/ubuntu:act-latest. Each job runs inside a fresh container of this image. Don't assume any pre-installed toolchains beyond what catthehacker ships.
  • Disk pressure: the runner host has ~40 GB of usable overlay space, often 70%+ used at job start. Every job that does load: true (smoke) starts with a Reclaim runner disk step that strips catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM, Boost, Chromium, PowerShell) and prunes stale docker state. Don't remove these steps without testing on a fresh runner.
  • Concurrency: 2 runners. Jobs in the same workflow run can fan out to both; jobs in different workflow runs are serialized by gitea's queue. The concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false } setting keeps tag pushes from racing each other but allows per-PR/per-branch parallelism.
  • Workflow visibility in UI: gitea Actions only surfaces workflows from the default branch in the web UI's workflow list, even for workflow_dispatch triggers. Workflows on feature branches are invisible until merged to main.
  • Disk reclaim quirk: actions/{upload,download}-artifact@v4+ does not work on Gitea (depends on a GitHub-only Artifact API). Stick to @v3 if matrix-fanout-with-artifacts is ever needed. We avoided this by using docker/build-push-action@v7 with comma-separated platforms: linux/amd64,linux/arm64 — natively does multi-arch push in a single job, no artifact dance.

Migration plan: split-base → production

  1. Validate the split-base dispatch. Trigger docker-publish-split.yml manually with release_tag=v0.0.0-split-test and promote_latest=false. Confirm all jobs go green, image sizes match the production baseline within ~10%, and no unexpected layer rebuilds appear in build-variant-* logs after the FROM line.
  2. Run a second dispatch to confirm cache-hit behavior: base-decide should set need_build=false, build-base should be skipped entirely, total wall clock should drop to ~2540 min.
  3. Cut over. In a single commit:
    • Edit docker-publish-split.yml: change on: workflow_dispatch: to on: push: tags: v* and wire $GITHUB_REF into the release_tag input, set promote_latest=true for production runs.
    • Delete docker-publish.yml.
    • Delete the original Dockerfile (keep Dockerfile.base + Dockerfile.variant).
    • Update CHANGELOG.md: promote the "Build pipeline" Unreleased entry.
  4. Tag a release. First production release on the new pipeline. Watch it like a hawk for the first run.
  • AGENTS.md — domain facts, release-day checklist, documentation coupling rules. Read first when modifying CI behavior.
  • CHANGELOG.md — the build pipeline rewrite is recorded under Unreleased until the cutover lands.
  • Dockerfile, Dockerfile.base, Dockerfile.variant — production single-Dockerfile build and the split-base counterparts. Comments at the top of each explain its role.
  • scripts/smoke-test.sh — invoked by all three workflows; this is the single source of truth for "what does a built image have to satisfy".
  • scripts/generate-dockerhub-md.py — generates DOCKER_HUB.md from HUB_TEMPLATE. --check enforces sync in validate.yml.