From 51ec4a88cfc7e10258bd28357ea7d23122fb4704 Mon Sep 17 00:00:00 2001 From: Joakim Persson Date: Thu, 28 May 2026 10:40:08 +0000 Subject: [PATCH] CI: drop registry cache-export from build-base (Hub 400 root cause) Diagnosed during manual v1.15.12 publish: buildkit's mode=max cache export to registry-1.docker.io reproducibly returns HTTP 400 with HTML body on the resumable-upload PUT. Image push (layers + manifest) works fine in parallel; only --cache-to fails. Removing cache-from/cache-to lets the publish complete. This explains all four prior CI failures (runs 332/333/334/336) which shared the exact same failure shape. Action-pin hypothesis (setup-buildx-action v4.1.0) was correctly disproven by run 336 with v4.0.0 pinned. Trade-off: every Dockerfile.base change now pays the full ~3 min multi-arch build. Unchanged bases short-circuit at the content-addressed probe step in base-decide and never re-build, so day-to-day cost is zero. Re-enable when moby/buildkit upstream resolves the cache-export protocol mismatch with Hub CDN, or when we can switch to a non-registry cache backend. CHANGELOG.md: full root-cause writeup in Unreleased section, including status update on every prior suspect (all ruled out). --- .gitea/workflows/docker-publish-split.yml | 11 ++++++--- CHANGELOG.md | 28 ++++++++++++++++++++++- 2 files changed, 35 insertions(+), 4 deletions(-) diff --git a/.gitea/workflows/docker-publish-split.yml b/.gitea/workflows/docker-publish-split.yml index e89ea5f..81c55c0 100644 --- a/.gitea/workflows/docker-publish-split.yml +++ b/.gitea/workflows/docker-publish-split.yml @@ -192,9 +192,14 @@ jobs: platforms: linux/amd64,linux/arm64 push: true tags: ${{ env.IMAGE }}:${{ needs.base-decide.outputs.base_tag }} - # Registry cache for faster repeat base rebuilds (e.g. Node bump). - cache-from: type=registry,ref=${{ env.IMAGE }}:base-buildcache - cache-to: type=registry,ref=${{ env.IMAGE }}:base-buildcache,mode=max + # Registry cache disabled: buildkit's cache-export (mode=max) hits a + # reproducible HTTP 400 from registry-1.docker.io on the resumable- + # upload PUT (state-token format mismatch on Hub CDN, suspected to + # have started ~2026-05-23). Image push itself works fine. We pay + # the full base build on every Dockerfile.base change, but the base + # tag itself is content-addressed (base-) so unchanged bases + # short-circuit at the probe step and never re-build anyway. Re- + # enable when upstream resolves; tracked in CHANGELOG v1.15.12. # ── Phase 3: amd64 smoke per variant (gates the multi-arch publish) ─ # Each smoke job builds amd64-only against the base tag and runs diff --git a/CHANGELOG.md b/CHANGELOG.md index 4fdadff..a7d93d1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,33 @@ Tags follow `v{opencode_version}[letter]` — bare tag for the first build on a ## Unreleased -_(no changes since v1.15.12)_ +### Hub-push regression — root cause identified, CI fixed + +The `400 Bad request` from `registry-1.docker.io` that broke CI publishing across runs #332/333/334/336 (and forced v1.15.12 to ship via manual host-side push) is **buildkit's registry cache-export with `mode=max`**, not the image push itself. + +**Diagnostic that nailed it:** the manual v1.15.12 publish from an Orbstack host reproduced the exact same 400 — but only on the cache-export step. Image layers pushed cleanly (911s for the base, all variants succeeded). Dropping `--cache-to` from the manual script let the publish complete. Running the same buildx version against the same Hub account from the same network, the only differential was cache export vs. image export. + +This explains every observation: + +- Failure shape stable across attempts (`Offset:0`, HTML body, CDN-tier rejection): cache-export protocol-level mismatch, not transient network or per-blob corruption. +- Repo-specific (`joakimp/opencode-devbox` only): we're the only Hub repo currently writing a `:base-buildcache` tag with `mode=max`. +- Started ~2026-05-23: lines up with buildx 0.34.x rolling out and bundling moby/buildkit v0.30.0, which changed the `_state` token format on resumable cache uploads. +- Image push works fine: cache-export is a separate codepath using a different manifest/layer scheme. +- Action-pin to `setup-buildx-action@v4.0.0` didn't help: that pin pulls older actions-toolkit, but the bundled buildkit was still 0.34.x via Buildx CLI on the runner image. Pin was correctly disproven by run #336. + +### Workflow change + +- **`.gitea/workflows/docker-publish-split.yml`** — registry cache (`cache-from`/`cache-to`) removed from the `build-base` step. Comment in place documenting the regression and the re-enable condition. Variants don't use registry cache so they're untouched. The base tag is content-addressed (`base-` derived from Dockerfile.base + rootfs/* + entrypoint*.sh) so unchanged bases short-circuit at the Hub-probe step in `base-decide` and never re-build anyway — the lost cache only affects the rare case of a Dockerfile.base change, where we now pay the full ~3 min build instead of pulling cached layers. Acceptable trade-off vs. broken publishes. + +Next tag push (e.g. v1.15.13) is expected to publish cleanly via Gitea CI again. validate.yml on this main push will be the first real-time test of the smoke side; full publish path will be tested on the next opencode bump or by a deliberate letter-suffix re-tag. + +### Status of earlier suspects + +- ~~`setup-buildx-action@v4.1.0`~~ — disproven by v1.15.11b CI run #336 with v4.0.0 pin failing identically. Pin reverted in v1.15.12. Not the regressor. +- ~~`@docker/actions-toolkit 0.79.0 → 0.90.0`~~ — rolled back via the action pin; same failure. Not the regressor. +- ~~Account / repo / Hub-CDN globally~~ — local pushes from developer host succeed. Always was healthy. +- ~~`catthehacker/ubuntu:act-latest`~~ / ~~act-runner egress~~ — manual publish from host reproduced the same 400, ruling out runner-side network. Not the cause. +- **Confirmed:** buildkit cache-export protocol (mode=max) hitting Hub-CDN edge rejection. Workaround: don't export cache to registry. Long-term: track moby/buildkit upstream for protocol fix or switch to GHA cache (not portable to Gitea Actions). ---