opencode-devbox/.gitea/README.md

# CI / Build Pipeline

This directory contains the gitea Actions workflows and the supporting
documentation for opencode-devbox's CI. If you're investigating *why*
the build pipeline is shaped the way it is, you're in the right place.

## Workflows in this directory

| File | Trigger | Role |
|---|---|---|
| [`workflows/docker-publish-split.yml`](workflows/docker-publish-split.yml) | `push: tags: v*` | **Production release pipeline.** Two-phase split-base build: shared `base-<hash>` published once (skipped on cache hit), then four parallel variant deltas. ~40–80 min wall clock depending on runner count and whether base needs rebuilding. |
| [`workflows/validate.yml`](workflows/validate.yml) | `push: branches: main` + PR | **Lightweight gate.** amd64-only smoke test of all four variants + `DOCKER_HUB.md` sync check. ~30 min. Fires on every push to `main`. |

## Why the split-base pipeline exists

opencode-devbox publishes **four image variants** (`base`, `omos`, `with-pi`, `omos-with-pi`) × **two architectures** (amd64, arm64) = **eight image tags per release**. Today's runners are 2 self-hosted gitea Actions runners. arm64 builds are emulated under QEMU, which is the dominant cost (~3–5x slower than native).

The four variants share ~95% of their layers (Debian + apt + Node + AWS CLI + mempalace + dev tools + entrypoints). The original `Dockerfile` was a single multi-stage build with `INSTALL_*` build-args gating variant-specific RUNs. BuildKit's per-layer cache key is content-addressed, but as soon as a build-arg-gated `RUN` produces a different layer hash for variant A vs variant B, every subsequent layer also has a different parent → identical commands re-execute per variant. Result: minimal cross-variant cache reuse on a fresh build.

Two improvements were considered:

1. **Reorder the original Dockerfile** so all variant-gated RUNs land at the bottom — modest gain, ~10–20% wall-clock reduction. *Not pursued.*
2. **Split into `Dockerfile.base` + `Dockerfile.variant`** with the base published as a long-lived shared image — significant gain, ~50–70% wall-clock reduction with hash-driven cache reuse. *Pursued.*

The split-base architecture is what the `docker-publish-split.yml` workflow exercises.

## How the split-base pipeline works

```
                       ┌──────────────────┐
                       │  base-decide     │   compute base-<hash>;
                       │                  │   probe Docker Hub.
                       │  hash inputs:    │
                       │   Dockerfile.base│
                       │   rootfs/        │
                       │   entrypoint*.sh │
                       └────────┬─────────┘
                                │
                  ┌─────────────┴─────────────┐
                  │ need_build = true?        │
                  └─────────────┬─────────────┘
                       yes      │       no
                                ▼
                       ┌──────────────────┐
                       │  build-base      │   multi-arch build,
                       │                  │   push base-<hash>
                       └────────┬─────────┘   to Docker Hub.
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
   ┌──────────┐            ┌──────────┐         ┌──────────────┐
   │smoke-base│            │smoke-omos│   ...   │smoke-omos-pi │   amd64 only,
   └────┬─────┘            └────┬─────┘         └──────┬───────┘   parallel.
        │                       │                      │
        ▼                       ▼                      ▼
   ┌──────────┐            ┌──────────┐         ┌──────────────┐
   │build-    │            │build-    │         │build-        │   multi-arch,
   │variant-  │            │variant-  │   ...   │variant-      │   parallel,
   │base      │            │omos      │         │omos-with-pi  │   tag push.
   └────┬─────┘            └────┬─────┘         └──────┬───────┘
        └───────────────────────┴──────────────────────┘
                                │
                                ▼
                  ┌──────────────────────────┐
                  │  promote-base-latest     │   crane copy
                  │                          │   base-<hash>
                  │                          │   → base-latest
                  └────────┬─────────────────┘
                           │
                           ▼
                  ┌──────────────────────────┐
                  │  update-description      │
                  └──────────────────────────┘
```

### Step 1: `base-decide`

Compute a SHA-256 hash over the inputs that determine the base image's
content:

```sh
{
  cat Dockerfile.base
  find rootfs -type f \
    ! -path '*/__pycache__/*' \
    ! -name '*.pyc' \
    ! -name '.DS_Store' \
    ! -name '._*' \
    -print0 | sort -z | xargs -0 cat
  cat entrypoint.sh entrypoint-user.sh
} | sha256sum | cut -c1-12
```

Junk filters keep the local recompute reproducible against CI's clean
checkout — `__pycache__/*.pyc` and macOS metadata files (`.DS_Store`,
`._AppleDouble`) are gitignored but still walked by `find -type f`.

The 12-character truncated hash becomes `base-<hash>`. Probe Docker Hub
for this tag via `docker manifest inspect`:

- If it exists → set `need_build=false`. `build-base` is skipped entirely.
- If it doesn't → set `need_build=true`. `build-base` runs.

This is the core cache-reuse mechanism. Version-bump-only releases
(only `Dockerfile.variant` or build-args changed) hit the cache. Releases
that change anything in the base — apt packages, AWS CLI, Node version,
locale list, entrypoint scripts — pay the full base-build cost once.

### Step 2: `build-base` (conditional)

Only runs when `need_build=true`. Multi-arch (amd64 + arm64) build of
`Dockerfile.base`, pushed to `joakimp/opencode-devbox:base-<hash>`.
Registry cache via `--cache-from/--cache-to` reduces incremental rebuilds
when only one or two layers changed.

The base image is **not** tagged `base-latest` here — that promotion
happens at the very end after all variants succeed (see step 5).

### Step 3: `smoke-*` (×4, parallel)

For each variant: build amd64-only against the base tag, load into
local docker, run [`scripts/smoke-test.sh`](../scripts/smoke-test.sh).
Variant build-args:

| variant | INSTALL_OPENCODE | INSTALL_OMOS | INSTALL_PI |
|---|---|---|---|
| `base` | true | false | false |
| `omos` | true | true | false |
| `with-pi` | true | false | true |
| `omos-with-pi` | true | true | true |

Smoke runs `--variant <name>` to enable variant-specific assertions.
Gate the publish: a smoke failure for variant X blocks `build-variant-X`.

### Step 4: `build-variant-*` (×4, parallel)

For each variant that passed smoke: multi-arch (amd64 + arm64) build of
`Dockerfile.variant`, pushed to Docker Hub with the user-facing release
tags:

| Build job | Tags pushed |
|---|---|
| `build-variant-base` | `vX.Y.Z`, `latest` |
| `build-variant-omos` | `vX.Y.Z-omos`, `latest-omos` |
| `build-variant-with-pi` | `vX.Y.Z-with-pi`, `latest-with-pi` |
| `build-variant-omos-with-pi` | `vX.Y.Z-omos-with-pi`, `latest-omos-with-pi` |

The `latest*` aliases are only updated when `promote_latest=true` (the
manual dispatch input) — for test runs, `promote_latest=false` keeps the
production aliases pointing at the previous good release.

### Step 5: `promote-base-latest`

Once all four variants successfully publish, re-tag `base-<hash>` as
`base-latest` using `crane copy`. This is a **manifest-level re-tag, not
a rebuild** — it touches only Docker Hub's image index, takes seconds,
and is atomic.

The reason this happens *after* variants succeed (rather than alongside
`build-base`) is so a partial failure leaves `base-latest` pointing at
the previous known-good base. External consumers who pin to
`base-latest` (e.g. the planned pi-devbox repo) never see a broken base.

### Step 6: `update-description`

Push the generated `DOCKER_HUB.md` to the Hub repo's `full_description`
field via the Hub REST API. Same step as the production pipeline.

## NPM_CONFIG_PREFIX gotcha (variant override pattern)

The base sets

```
ENV NPM_CONFIG_PREFIX=/home/developer/.pi/npm-global
```

This is intentional — it makes `pi install npm:<pkg>` and `npm install -g`
land on the `devbox-pi-config` named volume at runtime, so user-installed
packages survive container recreate AND image rebuild.

But the *variant build* inherits this prefix at build time. If left as-is,
`npm install -g opencode-ai@$VERSION` in `Dockerfile.variant` would
install opencode into `/home/developer/.pi/npm-global/...`, which is then
**shadowed by the volume mount at runtime** → opencode disappears from
PATH on first start.

Fix: each `npm install -g` in `Dockerfile.variant` overrides the prefix
per-RUN:

```dockerfile
RUN NPM_CONFIG_PREFIX=/usr npm install -g opencode-ai@${OPENCODE_VERSION}
```

Baked binaries land on `/usr/bin/...` (system prefix), survive the volume
mount. Runtime-installed user packages still land on
`~/.pi/npm-global/...`. Both visible on PATH.

## Cache strategy

Two registry caches are configured:

```yaml
cache-from: type=registry,ref=joakimp/opencode-devbox:base-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-buildcache,mode=max

cache-from: type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache
cache-to:   type=registry,ref=joakimp/opencode-devbox:base-variant-buildcache,mode=max
```

`mode=max` exports cache for *all* layers, not just the final image's
layers. Important for multi-arch builds where the cross-arch layer reuse
matters more.

## Wall-clock estimates

| Scenario | Production pipeline | Split-base pipeline |
|---|---|---|
| Version-bump-only release (only opencode/pi/omos version changed) | ~165–180 min | **~30–40 min** (base cache hit) |
| Base-touching release (apt/Node/Debian/entrypoint change) | ~165–180 min | **~70–90 min** (base rebuilds) |

The split-base pipeline pays its dues on base-touching releases (which are
infrequent — a few times a year for Debian / Node major version bumps).
Most releases are version-bumps and ride the cache.

## Validate workflow

[`validate.yml`](workflows/validate.yml) is the lightweight gate that runs
on every push to `main` and on PRs. It:

1. Runs `scripts/generate-dockerhub-md.py --check` to enforce
   `DOCKER_HUB.md` is in sync with `HUB_TEMPLATE`.
2. Builds each of the four variants amd64-only (no multi-arch, no push)
   and runs `scripts/smoke-test.sh`.

This catches regressions before they reach a tag push. Wall clock ~30 min.

## Runner expectations

- **Image:** `catthehacker/ubuntu:act-latest`. Each job runs inside a
  fresh container of this image. Don't assume any pre-installed
  toolchains beyond what catthehacker ships.
- **Disk pressure:** the runner host has ~40 GB of usable overlay space,
  often 70%+ used at job start. Every job that does `load: true` (smoke)
  starts with a `Reclaim runner disk` step that strips
  catthehacker-resident toolchains (Android SDK, .NET, Swift, GHC, JVM,
  Boost, Chromium, PowerShell) and prunes stale docker state. Don't
  remove these steps without testing on a fresh runner.
- **Concurrency:** 2 runners. Jobs in the same workflow run can fan out to
  both; jobs in *different* workflow runs are serialized by gitea's queue.
  The `concurrency: { group: ${{ workflow }}-${{ ref }}, cancel-in-progress: false }`
  setting keeps tag pushes from racing each other but allows
  per-PR/per-branch parallelism.
- **Workflow visibility in UI:** gitea Actions only surfaces workflows
  from the **default branch** in the web UI's workflow list, even for
  `workflow_dispatch` triggers. Workflows on feature branches are
  invisible until merged to `main`.
- **Disk reclaim quirk:** `actions/{upload,download}-artifact@v4+` does
  not work on Gitea (depends on a GitHub-only Artifact API). Stick to
  `@v3` if matrix-fanout-with-artifacts is ever needed. We avoided this
  by using `docker/build-push-action@v7` with comma-separated
  `platforms: linux/amd64,linux/arm64` — natively does multi-arch push
  in a single job, no artifact dance.

## Migration plan: split-base → production

1. **Validate the split-base dispatch.** Trigger
   `docker-publish-split.yml` manually with `release_tag=v0.0.0-split-test`
   and `promote_latest=false`. Confirm all jobs go green, image sizes
   match the production baseline within ~10%, and no unexpected layer
   rebuilds appear in `build-variant-*` logs after the FROM line.
2. **Run a second dispatch** to confirm cache-hit behavior:
   `base-decide` should set `need_build=false`, `build-base` should be
   skipped entirely, total wall clock should drop to ~25–40 min.
3. **Cut over** — *done as of v1.14.50.* `docker-publish-split.yml` now
   triggers on `push: tags: v*`. `docker-publish.yml` and original
   `Dockerfile` deleted.
4. **Tag a release.** First production release on the new pipeline.

## Related docs

- [`AGENTS.md`](../AGENTS.md) — domain facts, release-day checklist,
  documentation coupling rules. Read first when modifying CI behavior.
- [`CHANGELOG.md`](../CHANGELOG.md) — build pipeline rewrite landed in v1.14.50.
- `Dockerfile.base`, `Dockerfile.variant` — the split-base Dockerfiles.
  Comments at the top of each explain their role.
- [`scripts/smoke-test.sh`](../scripts/smoke-test.sh) — invoked by all
  three workflows; this is the single source of truth for "what does a
  built image have to satisfy".
- [`scripts/generate-dockerhub-md.py`](../scripts/generate-dockerhub-md.py)
  — generates `DOCKER_HUB.md` from `HUB_TEMPLATE`. `--check` enforces
  sync in `validate.yml`.