joakimp/opencode-devbox

Public Access

Fork 0

Files

T

joakimp 6fde27c212

Validate / docs-check (push) Successful in 16s

Details

Validate / validate-base (push) Successful in 12m9s

Details

Validate / validate-omos (push) Successful in 16m45s

Details

Validate / validate-with-pi (push) Successful in 13m30s

Details

Validate / validate-omos-with-pi (push) Successful in 15m15s

Details

Document the build pipeline architecture in .gitea/README.md

The split-base build architecture, the NPM_CONFIG_PREFIX gotcha, the
hash-driven base cache reuse mechanism, and the cutover plan from
docker-publish.yml to docker-publish-split.yml were previously
scattered across:
  - inline Dockerfile.base / Dockerfile.variant comments
  - CHANGELOG Unreleased entries
  - AGENTS.md mentions
  - docker-publish-split.yml header comment
  - my own session notes

Consolidate into .gitea/README.md as the canonical architectural doc.
Gitea (like GitHub) auto-renders this when navigating to .gitea/ in
the web UI, so anyone investigating 'why is CI shaped this way?'
finds it on the first click. Cross-referenced from AGENTS.md as the
first thing to read when touching CI.

Covers:
  - The two release pipelines and why both exist
  - Why split-base: cross-variant cache misses on layer-hash-divergence
  - The 6 phases of the split-base pipeline with an ASCII diagram
  - base-decide hash inputs and Docker Hub probe logic
  - NPM_CONFIG_PREFIX variant-override pattern (the volume-shadow trap)
  - Registry cache strategy (mode=max for cross-arch reuse)
  - Wall-clock estimates: version-bump vs base-touching releases
  - Validate workflow role
  - Runner expectations: catthehacker image, disk reclaim, concurrency,
    Gitea Actions @v4 artifact incompatibility
  - 4-step migration plan from docker-publish.yml to .split.yml
  - Cross-refs to related docs

Does not duplicate AGENTS.md content; links to it for domain facts and
release-day checklist.

2026-05-09 19:28:03 +02:00

15 KiB

Raw Blame History

CI / Build Pipeline

This directory contains the gitea Actions workflows and the supporting documentation for opencode-devbox's CI. If you're investigating why the build pipeline is shaped the way it is, you're in the right place.

Workflows in this directory

File	Trigger	Role
`workflows/docker-publish.yml`	`push: tags: v*`	Production release pipeline. Multi-arch build of all four variants (`base`, `omos`, `with-pi`, `omos-with-pi`), publish to Docker Hub, update Hub description. ~165–180 min wall clock.
`workflows/docker-publish-split.yml`	`workflow_dispatch` (manual)	Experimental split-base pipeline. Two-phase build: shared `base-<hash>` published once, then four thin variant deltas. Estimated ~30–40 min on cache hit, ~70–90 min when base needs rebuilding. Not yet validated end-to-end; once 1–2 dispatch test runs prove it, this will take over `on: push: tags: v*` and `docker-publish.yml` will be retired.
`workflows/validate.yml`	`push: branches: main` + PR	Lightweight gate. amd64-only smoke test of all four variants + `DOCKER_HUB.md` sync check. ~30 min. Fires on every push to `main`.

Why two release pipelines exist

opencode-devbox publishes four image variants (base, omos, with-pi, omos-with-pi) × two architectures (amd64, arm64) = eight image tags per release. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~3–5x slower than native).

The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original Dockerfile was a single multi-stage build with INSTALL_* build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated RUN produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.

Two improvements were considered:

Reorder the original Dockerfile so all variant-gated RUNs land at the bottom — modest gain, ~10–20% wall-clock reduction. Not pursued.
Split into Dockerfile.base + Dockerfile.variant with the base published as a long-lived shared image — significant gain, ~50–70% wall-clock reduction with hash-driven cache reuse. Pursued.

The split-base architecture is what the docker-publish-split.yml workflow exercises.

How the split-base pipeline works

                       ┌──────────────────┐
                       │  base-decide     │   compute base-<hash>;
                       │                  │   probe Docker Hub.
                       │  hash inputs:    │
                       │   Dockerfile.base│
                       │   rootfs/        │
                       │   entrypoint*.sh │
                       └────────┬─────────┘
                                │
                  ┌─────────────┴─────────────┐
                  │ need_build = true?        │
                  └─────────────┬─────────────┘
                       yes      │       no
                                ▼
                       ┌──────────────────┐
                       │  build-base      │   multi-arch build,
                       │                  │   push base-<hash>
                       └────────┬─────────┘   to Docker Hub.
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
   ┌──────────┐            ┌──────────┐         ┌──────────────┐
   │smoke-base│            │smoke-omos│   ...   │smoke-omos-pi │   amd64 only,
   └────┬─────┘            └────┬─────┘         └──────┬───────┘   parallel.
        │                       │                      │
        ▼                       ▼                      ▼
   ┌──────────┐            ┌──────────┐         ┌──────────────┐
   │build-    │            │build-    │         │build-        │   multi-arch,
   │variant-  │            │variant-  │   ...   │variant-      │   parallel,
   │base      │            │omos      │         │omos-with-pi  │   tag push.
   └────┬─────┘            └────┬─────┘         └──────┬───────┘
        └───────────────────────┴──────────────────────┘
                                │
                                ▼
                  ┌──────────────────────────┐
                  │  promote-base-latest     │   crane copy
                  │                          │   base-<hash>
                  │                          │   → base-latest
                  └────────┬─────────────────┘
                           │
                           ▼
                  ┌──────────────────────────┐
                  │  update-description      │
                  └──────────────────────────┘

Step 1: `base-decide`

Compute a SHA-256 hash over the inputs that determine the base image's content:

{
  cat Dockerfile.base
  find rootfs -type f -print0 | sort -z | xargs -0 cat
  cat entrypoint.sh entrypoint-user.sh
} | sha256sum | cut -c1-12

The 12-character truncated hash becomes base-<hash>. Probe Docker Hub for this tag via docker manifest inspect:

If it exists → set need_build=false. build-base is skipped entirely.
If it doesn't → set need_build=true. build-base runs.

This is the core cache-reuse mechanism. Version-bump-only releases (only Dockerfile.variant or build-args changed) hit the cache. Releases that change anything in the base — apt packages, AWS CLI, Node version, locale list, entrypoint scripts — pay the full base-build cost once.

Step 2: `build-base` (conditional)

Only runs when need_build=true. Multi-arch (amd64 + arm64) build of Dockerfile.base, pushed to joakimp/opencode-devbox:base-<hash>. Registry cache via --cache-from/--cache-to reduces incremental rebuilds when only one or two layers changed.

The base image is not tagged base-latest here — that promotion happens at the very end after all variants succeed (see step 5).

Step 3: `smoke-*` (×4, parallel)

For each variant: build amd64-only against the base tag, load into local docker, run scripts/smoke-test.sh. Variant build-args:

variant	INSTALL_OPENCODE	INSTALL_OMOS	INSTALL_PI
`base`	true	false	false
`omos`	true	true	false
`with-pi`	true	false	true
`omos-with-pi`	true	true	true

Smoke runs --variant <name> to enable variant-specific assertions. Gate the publish: a smoke failure for variant X blocks build-variant-X.

Step 4: `build-variant-*` (×4, parallel)

For each variant that passed smoke: multi-arch (amd64 + arm64) build of Dockerfile.variant, pushed to Docker Hub with the user-facing release tags:

Build job	Tags pushed
`build-variant-base`	`vX.Y.Z`, `latest`
`build-variant-omos`	`vX.Y.Z-omos`, `latest-omos`
`build-variant-with-pi`	`vX.Y.Z-with-pi`, `latest-with-pi`
`build-variant-omos-with-pi`	`vX.Y.Z-omos-with-pi`, `latest-omos-with-pi`

The latest* aliases are only updated when promote_latest=true (the manual dispatch input) — for test runs, promote_latest=false keeps the production aliases pointing at the previous good release.

Step 5: `promote-base-latest`

Once all four variants successfully publish, re-tag base-<hash> as base-latest using crane copy. This is a manifest-level re-tag, not a rebuild — it touches only Docker Hub's image index, takes seconds, and is atomic.

The reason this happens after variants succeed (rather than alongside build-base) is so a partial failure leaves base-latest pointing at the previous known-good base. External consumers who pin to base-latest (e.g. the planned pi-devbox repo) never see a broken base.

Step 6: `update-description`

Push the generated DOCKER_HUB.md to the Hub repo's full_description field via the Hub REST API. Same step as the production pipeline.

NPM_CONFIG_PREFIX gotcha (variant override pattern)

The base sets

ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global

This is intentional — it makes pi install npm:<pkg> and npm install -g land on the devbox-pi-config named volume at runtime, so user-installed packages survive container recreate AND image rebuild.

But the variant build inherits this prefix at build time. If left as-is, npm install -g opencode-ai@$VERSION in Dockerfile.variant would install opencode into /home/developer/.pi/npm-global/..., which is then shadowed by the volume mount at runtime → opencode disappears from PATH on first start.

Fix: each npm install -g in Dockerfile.variant overrides the prefix per-RUN:

RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}

Baked binaries land on /usr/bin/... (system prefix), survive the volume mount. Runtime-installed user packages still land on ~/.pi/npm-global/.... Both visible on PATH.

Cache strategy

Two registry caches are configured:

cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max

cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max

mode=max exports cache for all layers, not just the final image's layers. Important for multi-arch builds where the cross-arch layer reuse matters more.

Wall-clock estimates

Scenario	Production pipeline	Split-base pipeline
Version-bump-only release (only opencode/pi/omos version changed)	~165–180 min	~30–40 min (base cache hit)
Base-touching release (apt/Node/Debian/entrypoint change)	~165–180 min	~70–90 min (base rebuilds)

The split-base pipeline pays its dues on base-touching releases (which are infrequent — a few times a year for Debian / Node major version bumps). Most releases are version-bumps and ride the cache.

Validate workflow

validate.yml is the lightweight gate that runs on every push to main and on PRs. It:

Runs scripts/generate-dockerhub-md.py --check to enforce DOCKER_HUB.md is in sync with HUB_TEMPLATE.
Builds each of the four variants amd64-only (no multi-arch, no push) and runs scripts/smoke-test.sh.

This catches regressions before they reach a tag push. Wall clock ~30 min.

Runner expectations

Image: catthehacker/ubuntu:act-latest. Each job runs inside a fresh container of this image. Don't assume any pre-installed toolchains beyond what catthehacker ships.
Disk pressure: the runner host has ~40 GB of usable overlay space, often 70%+ used at job start. Every job that does load: true (smoke) starts with a Reclaim runner disk step that strips catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM, Boost, Chromium, PowerShell) and prunes stale docker state. Don't remove these steps without testing on a fresh runner.
Concurrency: 2 runners. Jobs in the same workflow run can fan out to both; jobs in different workflow runs are serialized by gitea's queue. The concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false } setting keeps tag pushes from racing each other but allows per-PR/per-branch parallelism.
Workflow visibility in UI: gitea Actions only surfaces workflows from the default branch in the web UI's workflow list, even for workflow_dispatch triggers. Workflows on feature branches are invisible until merged to main.
Disk reclaim quirk: actions/{upload,download}-artifact@v4+ does not work on Gitea (depends on a GitHub-only Artifact API). Stick to @v3 if matrix-fanout-with-artifacts is ever needed. We avoided this by using docker/build-push-action@v7 with comma-separated platforms: linux/amd64,linux/arm64 — natively does multi-arch push in a single job, no artifact dance.

Migration plan: split-base → production

Validate the split-base dispatch. Trigger docker-publish-split.yml manually with release_tag=v0.0.0-split-test and promote_latest=false. Confirm all jobs go green, image sizes match the production baseline within ~10%, and no unexpected layer rebuilds appear in build-variant-* logs after the FROM line.
Run a second dispatch to confirm cache-hit behavior: base-decide should set need_build=false, build-base should be skipped entirely, total wall clock should drop to ~25–40 min.
Cut over. In a single commit:
- Edit docker-publish-split.yml: change on: workflow_dispatch: to on: push: tags: v* and wire $GITHUB_REF into the release_tag input, set promote_latest=true for production runs.
- Delete docker-publish.yml.
- Delete the original Dockerfile (keep Dockerfile.base + Dockerfile.variant).
- Update CHANGELOG.md: promote the "Build pipeline" Unreleased entry.
Tag a release. First production release on the new pipeline. Watch it like a hawk for the first run.

AGENTS.md — domain facts, release-day checklist, documentation coupling rules. Read first when modifying CI behavior.
CHANGELOG.md — the build pipeline rewrite is recorded under Unreleased until the cutover lands.
Dockerfile, Dockerfile.base, Dockerfile.variant — production single-Dockerfile build and the split-base counterparts. Comments at the top of each explain its role.
scripts/smoke-test.sh — invoked by all three workflows; this is the single source of truth for "what does a built image have to satisfy".
scripts/generate-dockerhub-md.py — generates DOCKER_HUB.md from HUB_TEMPLATE. --check enforces sync in validate.yml.

15 KiB Raw Blame History Unescape Escape

CI / Build Pipeline

Workflows in this directory

Why two release pipelines exist

How the split-base pipeline works

Step 1: base-decide

Step 2: build-base (conditional)

Step 3: smoke-* (×4, parallel)

Step 4: build-variant-* (×4, parallel)

Step 5: promote-base-latest

Step 6: update-description

NPM_CONFIG_PREFIX gotcha (variant override pattern)

Cache strategy

Wall-clock estimates

Validate workflow

Runner expectations

Migration plan: split-base → production

Related docs

15 KiB

Raw Blame History

Step 1: `base-decide`

Step 2: `build-base` (conditional)

Step 3: `smoke-*` (×4, parallel)

Step 4: `build-variant-*` (×4, parallel)

Step 5: `promote-base-latest`

Step 6: `update-description`