From 05e88c5c75833fd8c15d82b89fe6b2933915663f Mon Sep 17 00:00:00 2001 From: pi Date: Sat, 13 Jun 2026 23:49:36 +0200 Subject: [PATCH] fix: mempalace-mcp uninterruptible hang resolved via toolkit ext timeout The per-request timeout + stall-kill landed in mempalace-toolkit's mempalace.ts pi extension (commit a3b8829), which the base clones at build via MEMPALACE_TOOLKIT_REF=main. A base rebuild picks it up. - CHANGELOG: move from 'Known issues' to 'Fixed'; document the env knobs (MEMPALACE_MCP_TIMEOUT_MS / MEMPALACE_MCP_INIT_TIMEOUT_MS) and why the standalone stdio-watchdog shim was dropped. - Dockerfile.base: replace the TODO with a note pointing at the fix. --- CHANGELOG.md | 49 ++++++++++++++++++++++++++++++------------------- Dockerfile.base | 17 +++++++++-------- 2 files changed, 39 insertions(+), 27 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 7746990..19f92e7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,28 +13,39 @@ Pre-v1.0.0 tags followed the pi npm version (`v{pi_version}[letter]`). ## Unreleased -### Known issues +### Fixed -- **`mempalace-mcp` can hang the pi TUI uninterruptibly** when the - palace is bind-mounted from the macOS host (OrbStack virtiofs) and the - container opens a large `chroma.sqlite3` for the first time. Symptoms: - pi sits silently after a tool call, ESC does not abort, no progress - output. Root cause is **not** WAL contention with another writer (we - initially suspected this and ruled it out — diagnosis 2026-06-13 with - no other mempalace process running). Most likely causes, in order: - 1. SQLite cold-open `fcntl`/`flock` semantics over OrbStack virtiofs - stalling the chromadb open path before mempalace-mcp emits its - `initialize` JSON-RPC response — pi blocks on the handshake. - 2. Cold HNSW index load/rebuild for a large wing (~23k drawers) doing - random-access I/O over virtiofs. - 3. Stale WAL recovery from a previously OOM-killed mempalace-mcp. +- **`mempalace-mcp` no longer hangs the pi TUI uninterruptibly.** When + the palace is bind-mounted from the macOS host (OrbStack virtiofs) and + the container opened a large `chroma.sqlite3` for the first time, a + cold storage open / HNSW load could stall the server before it emitted + its JSON-RPC response. The awaiting promise then hung forever and the + TUI froze — ESC cancels the LLM stream, not a pending MCP tool call, so + there was no way out short of `docker exec pkill -9 -f + mempalace-mcp` and restarting pi. - ESC not interrupting is a pi-side limitation: pi cancels the LLM stream - but keeps awaiting the MCP child's stdio. There is no per-call MCP - timeout in pi's config. Workaround when stuck: - `docker exec pkill -9 -f mempalace-mcp` then restart pi. + The fix lives in the `mempalace.ts` pi extension shipped by + **mempalace-toolkit** (cloned into the base at build time via + `MEMPALACE_TOOLKIT_REF`, default `main`): the JSON-RPC client now arms + a **per-request** timeout. On expiry it rejects the request *and* kills + the stalled child (SIGTERM→SIGKILL), so pi surfaces an error instead of + hanging; the bridge then marks itself unavailable so subsequent calls + fail fast (restart pi to retry). This is deliberately per-REQUEST, not + a process-lifetime `timeout 60 mempalace-mcp` wrapper — the long-lived + server is only killed when a request genuinely stalls. - Planned fix: a thin Python stdio-watchdog shim in front of + Tunables (env): `MEMPALACE_MCP_TIMEOUT_MS` (tool-call timeout, default + `60000`), `MEMPALACE_MCP_INIT_TIMEOUT_MS` (initialize/tools-list + handshake, default `120000`); set either to `0` to disable. Requires a + base rebuild to pull the updated extension. The earlier plan of a + standalone Python stdio-watchdog shim was dropped: the extension + already owns request/response correlation, so a separate + framing-reparsing shim is unnecessary. + + Still open (out of scope here): sharing one palace across harnesses + ideally wants a single host-side `mempalace-mcp` daemon multiplexing + stdio over a UNIX socket, so all clients share one writer on native + APFS rather than each cold-opening over virtiofs. `mempalace-mcp` that applies a per-request timeout and kills the child on stall, **without** killing the long-lived server itself (a naive `timeout 60 mempalace-mcp` wrapper is wrong — it kills the server diff --git a/Dockerfile.base b/Dockerfile.base index 9f3ae0d..c44ae25 100644 --- a/Dockerfile.base +++ b/Dockerfile.base @@ -280,14 +280,15 @@ RUN ARCH=$(case "${TARGETARCH}" in amd64) echo "x86_64" ;; arm64) echo "aarch64" # Always installed in the base. Set INSTALL_MEMPALACE=false at base-build # time to shave ~300 MB. # -# TODO(2026-06-13): wrap mempalace-mcp with a stdio-watchdog shim that -# applies a per-REQUEST timeout (not a per-process timeout — naive -# `timeout 60 mempalace-mcp` would kill the long-lived server mid-session). -# When the palace is bind-mounted from macOS via OrbStack virtiofs, cold -# chroma.sqlite3 open or HNSW load can stall the JSON-RPC `initialize` -# response and pi's TUI sits uninterruptibly (ESC cancels the LLM stream, -# not the MCP child stdio). See CHANGELOG.md "Unreleased > Known issues". -# Recovery today: `docker exec pkill -9 -f mempalace-mcp`. +# Stall protection (fixed 2026-06-13): mempalace-mcp is launched by the +# `mempalace.ts` pi extension from mempalace-toolkit (cloned below). That +# extension now applies a per-REQUEST timeout in its JSON-RPC client and +# kills the child on stall, so a virtiofs cold-open of chroma.sqlite3 / +# HNSW load can no longer hang the pi TUI uninterruptibly. Tunables: +# MEMPALACE_MCP_TIMEOUT_MS (default 60000), MEMPALACE_MCP_INIT_TIMEOUT_MS +# (default 120000); 0 disables. A standalone stdio-watchdog shim is NOT +# needed — the extension already owns request/response correlation. See +# CHANGELOG.md "Unreleased > Fixed". ARG INSTALL_MEMPALACE=true # Pin to a known-good version. Bump deliberately, not implicitly: an # unpinned install silently swept in mempalace 3.3.x/3.4.0 with a broken