From 6cc2670a9381eabc30219d759661f7ac30bb8916 Mon Sep 17 00:00:00 2001 From: pi Date: Thu, 28 May 2026 16:21:40 +0200 Subject: [PATCH] docs: manual host-publish runbook + cache-export gotcha in AGENTS.md Captures the escape-hatch procedure used to ship v1.15.12 on 2026-05-28 when buildkit cache-export mode=max started returning HTTP 400 from the Hub CDN, breaking five consecutive CI publishes (runs #332/333/334/336 + a rerun). - docs/manual-host-publish.sh: the literal script that shipped v1.15.12 from a developer Mac via Orbstack, preserved as-is for future reference. - docs/manual-host-publish.md: runbook explaining when to reach for it, the four constants to edit, three ways to source BASE_HASH (CI log / Hub probe / local recompute matching base-decide's exact recipe including __pycache__/.DS_Store junk filters), and adaptations for pi-devbox / letter-suffix rebuilds / partial-failure recovery. - AGENTS.md: new Critical conventions bullet documenting the cache-from /cache-to disablement, failure shape, repo-specificity, why action pinning didn't help, the trade-off, and the re-enable condition. Cross-references CHANGELOG v1.15.12 Unreleased + the new runbook. --- AGENTS.md | 1 + docs/manual-host-publish.md | 127 ++++++++++++++++++++++++++++++++++++ docs/manual-host-publish.sh | 117 +++++++++++++++++++++++++++++++++ 3 files changed, 245 insertions(+) create mode 100644 docs/manual-host-publish.md create mode 100755 docs/manual-host-publish.sh diff --git a/AGENTS.md b/AGENTS.md index 38f56cf..ea8e2c6 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -73,6 +73,7 @@ cd /tmp && npm pack @earendil-works/pi-coding-agent@0.75.5 && tar -xzf earendil- - **GitHub/Gitea-sourced binaries float by default** — gosu, fzf, git-lfs, gitleaks, nvim, bat, eza, zoxide, uv, gitea-mcp, Go, oh-my-opencode-slim all default to `latest`. Each build-time install step reads the `/releases/latest` Location redirect (or the go.dev JSON feed for Go) and derives the concrete version. Use the same `ARCH` case-switch pattern for multi-arch support (amd64/arm64) — mind project-specific arch-name deviations (gitleaks uses `x64`, bat/eza/zoxide use `x86_64`/`aarch64`, gosu uses `amd64`/`arm64`). Intentional pins: `OPENCODE_VERSION` (drives the image tag), `NODE_VERSION=22` (major pin), `DEBIAN_VERSION=trixie-slim` (OS base). Adding a new upstream tool: follow the existing floated-version pattern, don't hardcode a specific tag. - **Resolved versions are logged by the smoke test** — `scripts/smoke-test.sh` prints a "Resolved component versions" table as its first step. CI logs always capture what got baked into a given image even when ARGs default to `latest`. - **`PI_VERSION` and `OMOS_VERSION` MUST be passed by CI as concrete versions**, not left at the `latest` default. The npm install steps in `Dockerfile.variant` (`npm install -g @earendil-works/pi-coding-agent` / `oh-my-opencode-slim@${OMOS_VERSION}`) produce identical layer-hashes when the ARG values are byte-identical across builds; combined with the registry buildcache (`base-buildcache`) the layer gets reused even when `latest` would have resolved to a newer upstream. This is the same class of bug that bit pi-devbox v0.74.0 → v0.75.5 (silent same-bytes-across-releases regression discovered 2026-05-23, fixed in pi-devbox v0.75.5b). It is currently *masked* in opencode-devbox by `OPENCODE_VERSION` being a hard-coded ARG that bumps every release — that bump invalidates the parent-chain cache key for the downstream pi/omos layers — but the masking would fail the moment a `vN.N.Nb` opencode-version-unchanged release ships that only bumps pi or omos. Preventative fix: `.gitea/workflows/docker-publish-split.yml` has a `resolve-versions` job that runs `npm view @earendil-works/pi-coding-agent version` and `npm view oh-my-opencode-slim version`, exposing concrete values as outputs that every variant smoke + build job consumes via build-args. Smoke tests assert via `EXPECTED_PI_VERSION` / `EXPECTED_OMOS_VERSION` env vars — would catch the regression on the next release rather than four releases later. **If you change the variant build-args list, the resolve-versions job, or the smoke EXPECTED_*_VERSION wiring, audit all affected jobs in lockstep.** +- **Registry buildkit cache-export is currently disabled** — do NOT re-add `cache-from`/`cache-to` to the `build-base` step in `.gitea/workflows/docker-publish-split.yml` without first verifying that buildkit's `mode=max` cache-export to `registry-1.docker.io` no longer returns HTTP 400 from the Hub CDN edge. The regression surfaced ~2026-05-23 and broke five consecutive opencode-devbox publish attempts (runs #332/333/334/336 + a rerun); root-caused on 2026-05-28 by a manual host-side publish that reproduced the same 400 only on `--cache-to` while image push worked fine. Failure shape is stable (`Offset:0` in the `_state` token, HTML response body = CDN-tier rejection, not registry backend), repo-specific (we're the only repo writing `:base-buildcache` mode=max), and explains why pinning `setup-buildx-action@v4.0.0` didn't help (action pin doesn't change the bundled buildkit version on the catthehacker runner image). Trade-off: dockerfile.base changes pay a full ~3 min rebuild instead of pulling cached layers; unchanged bases short-circuit at the Hub-probe step in `base-decide` and never re-build anyway. Variants don't use registry cache so they're unaffected. Re-enable condition: upstream moby/buildkit fix lands AND a low-risk test run succeeds without 400s. See CHANGELOG v1.15.12 `Unreleased` block for the full diagnostic chain. Manual escape-hatch publish procedure: `docs/manual-host-publish.md`. - **Shell scripts use `set -euo pipefail`** — both entrypoints are strict. Errors in volume chown or SSH permission operations are intentionally suppressed with `|| true`. - **MemPalace install path** — installed via `uv tool install` into `/opt/uv-tools/mempalace/`. Both the `mempalace` CLI and the `mempalace-mcp` MCP server binary are shipped as entry points by the mempalace package itself and placed on PATH by uv as shims whose shebangs point at the venv's Python. No hand-rolled wrapper is needed. Do not use `pip install --break-system-packages` — that was the previous approach and has been removed. Do not use `["python3", "-m", "mempalace.mcp_server"]` in `opencode.jsonc` — system Python can't import from the uv venv. - **generate-config.py idempotency** — the script MUST never overwrite an existing `opencode.jsonc` or legacy `opencode.json`. Config persists in the `devbox-opencode-config` named volume; accidentally clobbering that file would destroy hand-edits. The smoke test asserts this. diff --git a/docs/manual-host-publish.md b/docs/manual-host-publish.md new file mode 100644 index 0000000..15a7c7f --- /dev/null +++ b/docs/manual-host-publish.md @@ -0,0 +1,127 @@ +# Manual host-side publish — escape hatch when CI is broken + +This runbook is the procedure for publishing an opencode-devbox release **directly from a developer host** when the Gitea Actions → Docker Hub path is broken. Used in anger on 2026-05-28 to ship `v1.15.12` after five consecutive CI publish failures (runs #332/333/334/336 + a rerun) and as a parallel diagnostic that pinpointed the root cause (buildkit `cache-export mode=max` returning HTTP 400 from the Hub CDN). + +The procedure is also a **diagnostic probe**. If the host-side publish succeeds where CI fails, the failure is somewhere in the runner → Hub path (cache-export, runner egress, runner-image, action versions). If host-side fails the same way, the failure is in your local buildx + Hub combination and you need a different escape (different network, different account, file an upstream). + +## When to reach for this + +- Tag pushed, CI keeps failing on `docker buildx build --push`, the failure shape is stable across reruns. +- Failure body looks like a registry-tier rejection (HTTP 4xx, HTML response body, repeats on every retry) — i.e. not a transient. +- You've already disproved the obvious suspects (action pin, runner image, network) per the [`ci-release-watcher` skill](../../../.agents/skills/ci-release-watcher/SKILL.md) playbook. +- You need the release **shipped today** and don't want to wait for a CI fix to land + re-trigger. + +If CI is broken because **a workflow change you just made is bad**, fix the workflow and re-tag with a letter suffix. This runbook is for when the workflow looks correct but the publish path itself is broken. + +## Prerequisites on the host + +- Docker (or Orbstack on macOS) with `docker buildx` available — multi-arch publish needs `setup-qemu` equivalent. Orbstack ships QEMU emulators for both archs by default; on Linux install `qemu-user-static` and run `docker run --privileged --rm tonistiigi/binfmt --install all` once per host. +- `docker login` credentials for `joakimp` on Docker Hub (PAT or password). Confirm with `docker info | grep Username`. +- A clone of `opencode-devbox` checked out at the **exact tag** you want to publish. `git status` clean. `git describe --tags --exact-match HEAD` should print the tag. +- Network connectivity to `registry-1.docker.io` from the host. Verify with `curl -sI https://registry-1.docker.io/v2/ | head -1` (expects `401 Unauthorized` — that's the v2 API saying "auth required", which means you can reach it). + +## How to use this runbook + +A working reference script lives next to this doc: **[`docs/manual-host-publish.sh`](manual-host-publish.sh)**. It is the literal script that shipped opencode-devbox v1.15.12 on 2026-05-28 from a developer Mac via Orbstack, with the BASE_HASH and version pins of that release. To publish a different release, **copy it to a new file, edit four constants at the top, and run it**: + +```bash +cp docs/manual-host-publish.sh /tmp/manual-publish-vX.Y.Z.sh +# Edit at top of file: +# RELEASE_TAG="vX.Y.Z" +# BASE_HASH="<12-char hash from CI's base-decide step>" +# PI_VERSION="" +# OMOS_VERSION="" +bash /tmp/manual-publish-vX.Y.Z.sh +``` + +Keep the historical script in `docs/` as-is — it's an archive of the v1.15.12 publish, useful as a reference if a future debug needs to compare exact arg sets across releases. Don't edit it in place. + +The sections below explain what the script does and what you need to know to edit those four constants safely. + +## 1. Pin RELEASE_TAG + +The git tag you're publishing. Must match a tag in the local clone: + +```bash +git fetch && git checkout v1.15.13 # whatever you're publishing +git describe --tags --exact-match HEAD +``` + +The script asserts `HEAD == ${RELEASE_TAG}^{commit}` before doing anything destructive. If you've drifted, fix it with `git checkout` before running. + +## 2. Pin PI_VERSION and OMOS_VERSION + +Gitea CI's `resolve-versions` job queries the npm registry at workflow time and threads concrete versions through every variant build, mitigating the silent same-bytes-across-releases regression class documented in `AGENTS.md`. Do the same by hand: + +```bash +curl -sf https://registry.npmjs.org/@earendil-works%2Fpi-coding-agent/latest | jq -r .version +curl -sf https://registry.npmjs.org/oh-my-opencode-slim/latest | jq -r .version +``` + +Paste the two version strings into the script's `PI_VERSION` / `OMOS_VERSION` constants. Don't leave the script defaulting to `latest` — the registry buildcache will silently reuse a stale layer if the build-arg byte-equals a previous build. + +## 3. Pin BASE_HASH + +This is the 12-char hash that CI's `base-decide` job computes from `Dockerfile.base` + `rootfs/**` + `entrypoint*.sh`. Three ways to get it, in order of preference: + +**A. From a prior CI run on the same commit** (cheapest — if the Gitea Actions run that triggered on this tag got far enough to log `base-decide`'s output, just read it): + +``` +Gitea Actions → the run for vX.Y.Z → base-decide job → "Compute base tag" step → last line: + Computed base tag: base-XXXXXXXXXXXX +``` + +This is the canonical source. The whole reason for the manual escape is that *something later in CI broke* — `base-decide` itself is fast, deterministic, and almost always succeeds. + +**B. From an existing image on the Hub** if a recent release already published a `base-` tag and the inputs haven't changed, you can copy that hash. Confirm with `docker manifest inspect joakimp/opencode-devbox:base-latest` and read the digest — if it matches a `base-` you already see on the Hub, that hash is yours. + +**C. Compute it locally**, replicating CI's exact recipe (the script in `.gitea/workflows/docker-publish-split.yml` `base-decide.compute`): + +```bash +{ + cat Dockerfile.base + find rootfs -type f \ + ! -path '*/__pycache__/*' \ + ! -name '*.pyc' \ + ! -name '.DS_Store' \ + ! -name '._*' \ + -print0 2>/dev/null | sort -z | xargs -0 cat 2>/dev/null + cat entrypoint.sh entrypoint-user.sh +} | sha256sum | cut -c1-12 +``` + +The junk-file filters (`__pycache__`, `.DS_Store`, `._*` AppleDouble) matter — they are gitignored but `find -type f` picks them up locally and would diverge your hash from CI's clean checkout. Don't skip them. + +If method C disagrees with method A, **trust A** and find out why your local tree differs. The hash in CI is what's on the Hub; that's what variants must FROM. + +## What the script does (high level) + +After the constants are set, the script runs a 5-step procedure. No editing needed inside the body; the whole flow is parameterised by the four constants above plus `IMAGE` (which is fixed to `joakimp/opencode-devbox`). + +1. **Preflight** — buildx present, tag exists, `HEAD == tag`, multi-arch builder created if missing. +2. **Base build (conditional)** — probe `${IMAGE}:base-${BASE_HASH}` on the Hub; if missing, build it multi-arch and push. **No `--cache-from` / `--cache-to`.** That's the whole point of this escape. If the base push itself fails the same way CI did, stop — the regression has spread to image push and you need a different host or account, not this runbook. +3. **Promote `base-latest`** — `docker buildx imagetools create` re-tags by manifest reference. No rebuild. +4. **Variants × 4** — sequential (not parallel; one host's egress can't saturate four multi-arch pushes safely). Each variant is `Dockerfile.variant` `FROM ${IMAGE}:base-${BASE_HASH}` plus the appropriate `INSTALL_OMOS` / `INSTALL_PI` build-args, tagged `${RELEASE_TAG}${suffix}` and `latest${suffix}`. +5. **Verify** — prints the digest of all 10 expected tags (8 variant + base-hash + base-latest). Spot-check that each `vX.Y.Z*` and its `latest*` alias share a digest. + +Expected wall time on a recent Mac: ~25-40 min (base ~3 min if rebuilt, each variant ~3-7 min mostly QEMU arm64 emulation). + +## Optional: update DOCKER_HUB.md description + +CI's `update-description` job posts the rendered Hub description via the Hub API. The manual script does **not** do this — the release works fine without it. If you want parity, copy the curl invocation from the `update-description` job in `.gitea/workflows/docker-publish-split.yml` and run it from the host with a Hub PAT loaded into `HUB_PAT`. Cosmetic; can wait until CI is healthy and the next release pushes a fresh description automatically. + +## After: capture diagnostic value + +The whole point of running this manually is the diagnostic. Three things to record before moving on: + +1. **Did the host publish succeed?** If yes and CI was failing on the same exact code, you've localised the failure to the runner side (cache-export, network, runner image). If no, the failure is in your local buildx + Hub combination and CI is a victim, not a cause. +2. **What was different from CI?** Document at minimum: `docker buildx version`, the host's `buildx ls` output (driver name + version), whether you used `--cache-to` or not, and which network you were on. +3. **File the upstream.** If the diagnostic narrowed the failure to a specific buildkit/buildx behaviour, file at `moby/buildkit` or `docker/buildx` with: stable failure shape, the exact request URL fragment (`Offset:0` / `_state=...` / digest if visible), the timeline boundary when failures started, and what worked vs what failed in your repro. The 2026-05-28 cache-export-mode=max regression is a worked example. + +Restore CI as the primary publish path as soon as the underlying regression is fixed or worked around at workflow level. This runbook should be exercised rarely. + +## Variants of this runbook + +- **pi-devbox** — same idea, simpler: only one image (`joakimp/pi-devbox`), one tag pair (`vX.Y.Z` + `latest`), no split base. Adapt the script: drop the `BASE_HASH` constant + steps 2-3 + the variant function; replace with a single `docker buildx build --file Dockerfile --build-arg PI_VERSION=... --tag joakimp/pi-devbox:${RELEASE_TAG} --tag joakimp/pi-devbox:latest --push .`. +- **opencode-devbox letter-suffix rebuild** (e.g. `v1.15.12b`) — same procedure end-to-end. The `BASE_HASH` will probably be unchanged from the prior release if no rootfs/entrypoint/Dockerfile.base changes shipped, so the base-build step skips itself automatically via the Hub probe. +- **Single-variant publish** for partial-failure recovery (e.g. CI succeeded for base + 3 variants but the 4th failed) — comment out the three completed `build_variant` calls in your copy of the script. Keep `imagetools create` for `base-latest` only if it didn't already promote. Then re-run. diff --git a/docs/manual-host-publish.sh b/docs/manual-host-publish.sh new file mode 100755 index 0000000..0e563fb --- /dev/null +++ b/docs/manual-host-publish.sh @@ -0,0 +1,117 @@ +#!/usr/bin/env bash +# Manual publish of opencode-devbox v1.15.12 — bypasses broken Gitea-runner +# Hub push by building & pushing from a developer host (Orbstack/Docker Desktop). +# +# Mirrors what .gitea/workflows/docker-publish-split.yml would do: +# 1. Build & push Dockerfile.base → joakimp/opencode-devbox:base- +# 2. Promote → joakimp/opencode-devbox:base-latest +# 3. Build & push 4 variants on top of base-: +# :v1.15.12 :latest (INSTALL_OPENCODE only) +# :v1.15.12-omos :latest-omos (+ OMOS) +# :v1.15.12-with-pi :latest-with-pi (+ pi) +# :v1.15.12-omos-with-pi :latest-omos-with-pi (+ both) +# +# Usage on your host: +# 1. Make sure Orbstack/Docker Desktop is running with multi-arch enabled +# (docker buildx ls should show linux/amd64,linux/arm64). +# 2. docker login docker.io (joakimp account) +# 3. cd ~/path/to/opencode-devbox && git fetch && git checkout v1.15.12 +# 4. bash /path/to/this/script.sh +# +# Total expected time: ~25-40 min on a recent Mac (4 multi-arch builds, base +# layers cache after the first variant). + +set -euo pipefail + +IMAGE="joakimp/opencode-devbox" +RELEASE_TAG="v1.15.12" +BASE_HASH="8d72a9e44796" # sha256 of Dockerfile.base + rootfs/* + entrypoints (computed by CI logic) +BASE_TAG="base-${BASE_HASH}" +PI_VERSION="0.76.0" # resolved from npm @earendil-works/pi-coding-agent latest (2026-05-28) +OMOS_VERSION="1.1.1" # resolved from npm oh-my-opencode-slim latest (2026-05-28) +PLATFORMS="linux/amd64,linux/arm64" + +# -------- preflight -------- +echo "==> Preflight" +docker buildx version >/dev/null || { echo "buildx not available"; exit 1; } +git rev-parse --verify "$RELEASE_TAG" >/dev/null 2>&1 || { + echo "Tag $RELEASE_TAG not found locally. git fetch && git checkout $RELEASE_TAG first."; exit 1; } +[[ "$(git rev-parse HEAD)" == "$(git rev-parse "${RELEASE_TAG}^{commit}")" ]] || { + echo "HEAD is not at $RELEASE_TAG. git checkout $RELEASE_TAG first."; exit 1; } +docker buildx inspect default >/dev/null 2>&1 || docker buildx create --use --name multi --driver docker-container + +# Probe whether base- already exists on Hub (CI does this; saves 10 min if yes) +if docker manifest inspect "${IMAGE}:${BASE_TAG}" >/dev/null 2>&1; then + echo "==> Base tag ${IMAGE}:${BASE_TAG} already exists on Hub — skipping base rebuild" + SKIP_BASE=1 +else + echo "==> Base tag ${IMAGE}:${BASE_TAG} missing — will build" + SKIP_BASE=0 +fi + +# -------- 1. base (if needed) -------- +if [[ "$SKIP_BASE" == "0" ]]; then + echo "==> [1/5] Build & push Dockerfile.base → ${IMAGE}:${BASE_TAG}" + docker buildx build \ + --platform "$PLATFORMS" \ + -f Dockerfile.base \ + -t "${IMAGE}:${BASE_TAG}" \ + --push \ + . +fi + +# -------- 2. promote base-latest -------- +echo "==> [2/5] Promote ${IMAGE}:${BASE_TAG} → ${IMAGE}:base-latest" +docker buildx imagetools create -t "${IMAGE}:base-latest" "${IMAGE}:${BASE_TAG}" + +# -------- 3-5. variants -------- +build_variant() { + local suffix="$1" # "" | "-omos" | "-with-pi" | "-omos-with-pi" + local install_omos="$2" + local install_pi="$3" + local extra_args=() + [[ "$install_pi" == "true" ]] && extra_args+=(--build-arg "PI_VERSION=${PI_VERSION}") + [[ "$install_omos" == "true" ]] && extra_args+=(--build-arg "OMOS_VERSION=${OMOS_VERSION}") + + local versioned="${IMAGE}:${RELEASE_TAG}${suffix}" + local floating="${IMAGE}:latest${suffix}" + + echo "==> Build & push variant${suffix:-(default)} → ${versioned} + ${floating}" + docker buildx build \ + --platform "$PLATFORMS" \ + -f Dockerfile.variant \ + --build-arg "BASE_IMAGE=${IMAGE}:${BASE_TAG}" \ + --build-arg "INSTALL_OPENCODE=true" \ + --build-arg "INSTALL_OMOS=${install_omos}" \ + --build-arg "INSTALL_PI=${install_pi}" \ + ${extra_args[@]+"${extra_args[@]}"} \ + -t "${versioned}" \ + -t "${floating}" \ + --push \ + . +} + +echo "==> [3/5] Variant: base (opencode only)" +build_variant "" false false + +echo "==> [4/5] Variant: omos" +build_variant "-omos" true false + +echo "==> [4/5] Variant: with-pi" +build_variant "-with-pi" false true + +echo "==> [5/5] Variant: omos-with-pi" +build_variant "-omos-with-pi" true true + +echo +echo "==> Done. Verifying tags on Hub:" +for t in \ + "${RELEASE_TAG}" "latest" \ + "${RELEASE_TAG}-omos" "latest-omos" \ + "${RELEASE_TAG}-with-pi" "latest-with-pi" \ + "${RELEASE_TAG}-omos-with-pi" "latest-omos-with-pi" \ + "${BASE_TAG}" "base-latest" +do + d=$(docker manifest inspect "${IMAGE}:${t}" 2>/dev/null | python3 -c "import json,sys,hashlib; m=json.load(sys.stdin); print(m.get('digest','-'))" 2>/dev/null || echo "MISSING") + printf " %-32s %s\n" "$t" "$d" +done