catthehacker/ubuntu:act-latest ships Node/npm under /opt/acttoolcache/
with PATH updated only in /etc/environment. act_runner (nektos/act) does
not source /etc/environment — it reads the Docker image's ENV instructions
(inspectResult.Config.Env) which only contain DEBIAN_FRONTEND=noninteractive.
So npm is NOT on PATH and 'npm view ...' would have CI-failed on first run.
Fix: query the npm registry HTTP API directly with curl+jq, both of which
are already used extensively by this workflow (curl for Hub auth/manifest
inspect, jq for token parsing). The endpoint
https://registry.npmjs.org/<pkg>/latest
returns JSON with a 'version' field — equivalent to 'npm view <pkg> version'
but with no toolchain dependency.
Verified locally: both URLs resolve correctly to 0.75.5 (pi) and 1.1.1 (omos).
Evidence: nektos/act pkg/container/docker_run.go reads imageEnv from
inspectResult.Config.Env, not /etc/environment. DefaultPathVariable() in
linux_container_environment_extensions.go returns a hardcoded path with no
/opt/acttoolcache in it.
Mirrors the pi-devbox v0.75.5b fix (2026-05-23) onto the four-variant
pipeline here. The with-pi, omos, and omos-with-pi variants install
upstream npm packages whose *_VERSION build-args defaulted to 'latest'.
When the build-arg string is byte-identical across builds, the layer
hash is identical and the registry buildcache silently reuses the layer
from whatever upstream version was current when the cache was first
populated — same mechanism that shipped pi-devbox v0.74.0..v0.75.5 with
identical image bytes.
Currently masked here because OPENCODE_VERSION is a hard-coded ARG that
bumps every release; parent-chain cache invalidation flushes the
downstream pi/omos layers. Masking would fail on any vN.N.Nb opencode-
version-unchanged release that only bumps pi or omos. Filed last night
as parked followup; fixing preventatively now that #5 (AWS SSO inside
tor-ms22 container) cleared.
CHANGES
.gitea/workflows/docker-publish-split.yml — new resolve-versions job
running 'npm view @earendil-works/pi-coding-agent version' and
'npm view oh-my-opencode-slim version', exposing concrete strings as
job outputs. All six affected jobs (smoke-omos, smoke-with-pi,
smoke-omos-with-pi, build-variant-omos, build-variant-with-pi,
build-variant-omos-with-pi) now consume them as PI_VERSION /
OMOS_VERSION build-args. smoke-base / build-variant-base unaffected.
scripts/smoke-test.sh — new run_expect helper asserting an expected
substring in command output. The pi check uses EXPECTED_PI_VERSION;
the omos check uses EXPECTED_OMOS_VERSION against npm ls -g. Both env
vars are wired from resolve-versions outputs in the smoke jobs. Catches
this regression class on the next release, not four releases later.
Dockerfile.variant — comment blocks above OPENCODE_VERSION (source-
pinned, not subject to the bug), PI_VERSION (CI-resolved), and
OMOS_VERSION (CI-resolved) explaining the cache-hit footgun.
AGENTS.md — new convention bullet under 'Critical conventions' naming
the resolve-versions job + EXPECTED_*_VERSION wiring as the contract
to keep in lockstep when modifying variant build-args.
.gitea/README.md — Step 1 expanded to cover the parallel resolve-
versions job alongside base-decide; pipeline diagram updated.
CHANGELOG.md — Unreleased entry describing the fix, masking mechanism,
and audit footprint.
No image-content change expected on the next release vs what 'latest'
would have resolved to anyway. Purely makes the cache invalidate
correctly going forward.
Defensive against local-vs-CI hash divergence. `find rootfs -type f`
includes gitignored junk like rootfs/__pycache__/*.pyc and macOS
.DS_Store/._AppleDouble files, which CI's clean checkout never sees.
This bit us during v1.15.4 debugging when a stale generate-config.cpython-314.pyc
on the local rootfs/ produced base-3605aa6b6ab1 while CI computed
base-35ee5fe7861a. Took meaningful time to track down because git status
doesn't surface gitignored files.
Verified: same filter applied to current clean tree still produces
35ee5fe7861a (the published v1.15.4b base digest).
Recovery for v1.15.4's partial publish (omos-with-pi exceeded 3500 MB
smoke threshold; other 3 variants published cleanly). Two changes:
1. omos-with-pi threshold 3500 -> 3700 MB. Compounded growth from
opencode 1.15.0 -> 1.15.4 (4 patch versions) plus pi 0.74.0 -> 0.75.3
(minor + 3 patches) summed in the omos-with-pi variant, just over
the existing limit. Same pattern as prior threshold bumps (v1.14.31c,
v1.15.0b). Restores ~150 MB headroom for routine apt-upgrade drift.
2. update-description workflow bug fix. Pre-existing latent bug exposed
by v1.15.4's partial publish: update-description.needs includes all 4
build-variant-* jobs, and gitea Actions' default behavior is
'skipped need => skip dependent' \u2014 even when the job's own if:
condition is satisfied. So when build-variant-omos-with-pi was
skipped (because its smoke failed), update-description cascaded into
a skip too, and Hub description didn't refresh on v1.15.4 despite
3 variants publishing.
Fix: wrap if: in always() + explicit success check on the base
variant. Same fix applied to promote-base-latest preemptively (it
has the same latent bug, currently masked by the cache-hit gate).
No image-side changes \u2014 cache hit on base-35ee5fe7861a.
Two workflow-only changes for promote-base-latest, no image-side impact:
T14 \u2014 replace imjasonh/setup-crane@v0.4 with direct pinned crane install.
The action's bootstrap script calls api.github.com/.../releases/latest
at every run to discover the crane version. That call periodically
rate-limits and returns JSON without .tag_name, jq emits 'null', the
action then downloads .../releases/download/null/... \u2192 404 \u2192 'gzip:
unexpected end of file' \u2192 exit 2. We hit this on the v1.15.3 release
(2026-05-16) where it was cosmetic only \u2014 base-latest was already
correct from cache hit \u2014 but the red-X is annoying.
Replaced with curl + tar pinned to crane v0.21.6 (latest at time of
change). Same pattern as other GitHub-sourced binaries in the
Dockerfile layer (gosu, fzf, eza etc.); operator bumps CRANE_VERSION
deliberately when wanting updates.
T15 \u2014 gate promote-base-latest on need_build == 'true'. When the base
layer's content hash hasn't changed (cache hit on existing base-<hash>
from a prior run), base-latest already points at the correct digest.
The retag is a tautology, and any transient failure of it produces a
red-X for an operation that didn't need to happen. Skipping the job
entirely on cache-hit is correct and removes a whole class of cosmetic
failure. Manual workflow_dispatch with promote_latest=true still bypasses
the gate as an escape hatch (e.g., if base-latest got hand-deleted and
needs regeneration without rebuilding the base).
This will not trigger a CI publish run (main-branch commit, no tag).
Gitea Actions evaluates 'env.PROMOTE_LATEST' as empty in YAML 'if:'
contexts even though the same env var substitutes correctly in
shell run: blocks. Result: on v1.15.0/v1.15.0b tag pushes, the
build-variant-* jobs correctly pushed latest-* aliases (shell context),
but promote-base-latest and update-description got skipped (YAML
context), so the Hub README description wasn't refreshed.
Switch to evaluating github.ref_type directly in the if-conditions —
matches the production-trigger semantics and avoids the env-var
indirection that gitea evaluates inconsistently.
The previous two-step approach (build Dockerfile.base \ then
Dockerfile.variant FROM the local image) doesn't work: each
docker/build-push-action@v7 invocation runs in its own buildx
container context, and an image loaded into the host docker daemon
by step N is not visible to step N+1's buildx invocation.
Variant builds in validate.yml now FROM joakimp/opencode-devbox:base-latest
on Docker Hub, matching the production smokes' parent. Trade-off:
PRs/pushes that change Dockerfile.base, rootfs/, or entrypoint*.sh
are not exercised here \u2014 only release tags rebuild the base via
docker-publish-split.yml.
The new base-change-warning job surfaces a runtime warning when a
commit modifies any base-image input, telling the author to run a
workflow_dispatch test if they want full validation before merging.
Replace single-Dockerfile build with two-step: build Dockerfile.base
first (loads as opencode-devbox:validate-base), then build
Dockerfile.variant with BASE_IMAGE pointing at the local base image.
All four validate jobs updated.
- Bump OPENCODE_VERSION 1.14.44 -> 1.14.50 in Dockerfile.variant
- Cut over: docker-publish-split.yml now triggers on push: tags: v*
(was workflow_dispatch only). RELEASE_TAG and PROMOTE_LATEST derived
from github.ref_type/ref_name for tag-push; inputs still available
for manual workflow_dispatch runs.
- Delete docker-publish.yml (retired, replaced by split-base pipeline)
- Delete Dockerfile (retired, replaced by Dockerfile.base + Dockerfile.variant)
- Update CHANGELOG: promote Unreleased -> v1.14.50
- Update AGENTS.md, .gitea/README.md, validate.yml: remove all references
to the old single-Dockerfile pipeline and WIP migration plan
echo -e doesn't interpret \n in /bin/sh (dash), which is the default
shell in catthehacker/ubuntu:act-latest. This caused steps.tags.outputs.tags
to be empty, resulting in 'tag is needed when pushing to registry' from buildx.
Also fixes a secondary bug: TAGS='${TAGS}\n...' stored a literal backslash-n
rather than a real newline, which would have broken multi-tag output when
promote_latest=true.
Fix: replace with a brace block using plain echo, which produces actual newlines
and works in both sh and bash.
Two-Dockerfile split-base build alongside the existing single-Dockerfile
pipeline. Goal: cut CI wall clock from ~165-180min to ~30-40min on
typical version-bump-only releases by reusing a base image across the
four variants.
Files added:
- Dockerfile.base variant-independent layers (apt, locales, AWS
CLI, Node.js, mempalace, gitea-mcp, user setup,
chromadb prewarm, ENVs, entrypoints).
- Dockerfile.variant FROMs ${BASE_IMAGE} and adds opencode / pi /
omos / Go installs gated by INSTALL_* args.
Each npm install -g uses NPM_CONFIG_PREFIX=/usr
per-RUN to keep baked binaries off the volume-
shadowed ~/.pi/npm-global path inherited from
base.
- .gitea/workflows/docker-publish-split.yml
workflow_dispatch-only pipeline:
base-decide -> build-base (conditional) ->
smoke-* (4 parallel) -> build-variant-*
(4 parallel) -> promote-base-latest ->
update-description. Hash-driven base reuse:
if base-<sha> already exists on Docker Hub,
the build is skipped entirely. Inputs:
release_tag (test tag suffix, default
v0.0.0-split-test) and promote_latest
(default false; gates latest-* aliases and
Hub description update).
Files unchanged:
- Dockerfile, docker-publish.yml, validate.yml all left in place so
the production tag-push pipeline keeps working untouched.
Migration plan (in CHANGELOG Unreleased):
1. workflow_dispatch test run with promote_latest=false; verify the
four variant images smoke-pass and have plausible sizes.
2. Compare manifest digests against the same-version output from the
production pipeline (independent test run on the same commit).
3. Once verified across 1-2 release cycles, swap docker-publish-split.yml
to on: push: tags: v* and retire docker-publish.yml.
AGENTS.md and CHANGELOG.md updated with file roles and the migration
plan. Production pipeline behavior is bit-for-bit unchanged on this
branch.
.gitea/workflows/validate.yml:
Adds validate-with-pi (INSTALL_PI=true) and validate-omos-with-pi
(INSTALL_OMOS=true + INSTALL_PI=true). amd64 single-arch with smoke
test, no push.
.gitea/workflows/docker-publish.yml:
Adds smoke-with-pi → build-with-pi and smoke-omos-with-pi →
build-omos-with-pi job pairs. Each push-by-digest multi-arch
(amd64+arm64) to Docker Hub with two tags:
${VERSION}-with-pi + latest-with-pi
${VERSION}-omos-with-pi + latest-omos-with-pi
update-description.needs[] extended to wait on both new build jobs.
scripts/smoke-test.sh:
bun-presence check now treats omos and omos-with-pi as the bun
variants. Pi state assertions wait up to 30s for entrypoint-user.sh
to finish deploying pi-toolkit + extensions (omos-with-pi has more
setup work than the base+pi path; the previous sleep-1 was too short
and caused empty-error assertion failures on cold starts).
Local verification (arm64 via OrbStack):
base → 1871 MB, all checks PASS
omos → 2813 MB, all checks PASS
with-pi → 2277 MB, all checks PASS
omos-with-pi → 3030 MB, all checks PASS
CI now produces 8 Docker Hub tags per release:
vX.Y.Z[n], latest
vX.Y.Z[n]-omos, latest-omos
vX.Y.Z[n]-with-pi, latest-with-pi
vX.Y.Z[n]-omos-with-pi, latest-omos-with-pi
v1.14.31c's matrix jobs failed on Upload digest with GHESNotSupportedError
— Gitea Actions doesn't support actions/upload-artifact@v4+.
Separately, build-omos arm64 hung silently for 12 min in Set-up job,
likely catthehacker pull contention between concurrent matrix children.
Rather than downgrade artifacts to @v3, collapse the matrix entirely.
docker/build-push-action@v7 with platforms: linux/amd64,linux/arm64
publishes a proper multi-arch manifest in one job, so the
artifact-passing and imagetools create merge dance only existed to
support a matrix split we no longer need.
The matrix was designed around load: true disk exhaustion (v1.14.30b),
but push-by-digest streams straight to the registry with fundamentally
different disk profile. Reclaim step gives enough headroom for the
combined amd64+arm64 push case.
Workflow: 7 jobs → 5. docker-publish.yml: 263 → ~110 lines of YAML.
Also:
- timeout-minutes: 90 on build jobs so hung builds fail explicitly
- BUILDKIT_PROGRESS=plain at workflow level for line-by-line arm64 logs
- AGENTS.md §CI quirks documents the Gitea-specific traps
(upload-artifact@v3-only, dash-not-bash, build-push-action@v7
multi-arch convention, reclaim requirement)
v1.14.31b made it through smoke-base and validate-base (reclaim worked),
but two narrow bugs blocked the rest:
1. 'Derive platform slug' in the per-arch matrix jobs used bash
${PLATFORM_PAIR//\//-} which dash (/bin/sh in the runner) can't
parse — 'Bad substitution'. Rewrote with 'tr / -'.
2. smoke-omos image size 3107 MB tripped the 3000 MB guardrail. All
functional checks pass; the mempalace-toolkit bake-in from v1.14.30b
added ~100 MB and the threshold was stale. Bumped to 3200 MB.
No image-level changes.
v1.14.31 publish and validate both hit 'No space left on device' on
single-arch amd64 smoke/validate builds. The image has crossed ~3 GB
and the runner's ~40 GB overlay starts ~70% full, so 'load: true'
peak disk (tarball + unpacked image + buildx cache) no longer fits.
Add a 'Reclaim runner disk' step to validate-base, validate-omos,
smoke-base, smoke-omos. Strips catthehacker-resident toolchains we
never use (hosted-tool-cache, dotnet, android, powershell, swift,
ghc, jvm, microsoft, chromium, boost), then runs 'docker system
prune -af --volumes' + 'docker builder prune -af' against the
runner's dockerd before setup-buildx-action. Expected reclaim is
6-12 GB depending on what's resident.
Deliberately NOT in the per-arch matrix build jobs — push-by-digest
doesn't need it and pruning in parallel jobs risks one job nuking
another's in-flight buildx cache. Also add workflow-level
concurrency on docker-publish.yml so concurrent tag pushes serialize
cleanly.
The v1.14.30b publish failed on both variants with 'No space left on
device' — arm64 QEMU-emulated layers were stored alongside amd64 on the
same ~40 GB runner, and the mempalace-toolkit bake-in from v1.14.30b
tipped peak disk over the edge during the nodejs dpkg unpack and the
git-lfs layer export.
Refactor docker-publish.yml to the canonical push-by-digest +
manifest-merge pattern: smoke test (amd64) runs on its own runner, each
(variant x arch) push target runs on its own fresh runner with
outputs=type=image,push-by-digest=true,push=true (no local image
store), then a tiny merge job assembles the multi-arch manifest with
docker buildx imagetools create from digest artifacts. Per-runner disk
peak is roughly one-quarter of the old single-job peak. The four
Docker Hub tags per release are unchanged. As a bonus, amd64 and arm64
now build in parallel.
No image-level changes beyond the opencode bump.
Main changes:
- Extract opencode.json generation from entrypoint-user.sh into a
standalone Python script (rootfs/usr/local/lib/opencode-devbox/
generate-config.py). Preserves the never-overwrite-existing-config
guarantee. Cuts entrypoint-user.sh from 176 to 97 lines.
- Install MemPalace via 'uv tool install' into an isolated venv at
/opt/uv-tools/mempalace/ with a /usr/local/bin/mempalace-mcp-server
wrapper, replacing the 'pip install --break-system-packages' escape
hatch. The wrapper is what generate-config.py references in the
auto-generated opencode.json. Also fix 'mempalace init' in
entrypoint-user.sh to use --yes so first-start initialization isn't
interactive (this used to hang or print prompts into the user's
terminal). Gated by INSTALL_MEMPALACE build arg (default true) so
users who don't need AI memory can shave ~300 MB.
- Sentinel-file pattern in entrypoint.sh volume-ownership loop: write
.devbox-owner after a successful chown -R, skip the recursive walk
on subsequent starts when the sentinel matches FINAL_UID:FINAL_GID.
Cuts multi-second startup costs to milliseconds on large volumes
(nvim plugins, palace data). UID changes still trigger a full chown.
- Float all GitHub/Gitea-hosted binary versions: gosu, fzf, git-lfs,
neovim, bat, eza, zoxide, uv, gitea-mcp now default to 'latest' and
resolve the newest upstream release at build time via the /releases/
latest redirect. Go (go.dev JSON feed) and oh-my-opencode-slim (npm
@latest) likewise. Intentional pins still in place: OPENCODE_VERSION,
NODE_VERSION=22, DEBIAN_VERSION=trixie-slim. Each *_VERSION ARG
accepts an explicit value to lock a specific version when needed.
- New scripts/smoke-test.sh verifies binary presence, opencode startup,
entrypoint user drop, generate-config idempotency, bun's presence-
per-variant, and image size against thresholds (2500 MB base, 3000
MB OMOS). Prints resolved component versions as its first step so
CI logs always record what got baked into a given image.
- New .gitea/workflows/validate.yml runs on push to main and PRs:
single-arch amd64 build, smoke test, DOCKER_HUB.md sync check. Tag-
triggered docker-publish.yml now smoke-tests each variant on amd64
before the full multi-arch push.
- scripts/generate-dockerhub-md.py auto-generates DOCKER_HUB.md from
README.md using explicit SECTION_RULES. --check mode fails CI when
the committed file is out of sync. Enforces the 25 kB Docker Hub
limit. Adding a new README section forces an explicit keep/drop/
replace decision.
- Remove dead INSTALL_PYTHON build arg (was a no-op since mempalace
added python3 unconditionally).