fix: mempalace-mcp uninterruptible hang resolved via toolkit ext timeout
The per-request timeout + stall-kill landed in mempalace-toolkit's mempalace.ts pi extension (commit a3b8829), which the base clones at build via MEMPALACE_TOOLKIT_REF=main. A base rebuild picks it up. - CHANGELOG: move from 'Known issues' to 'Fixed'; document the env knobs (MEMPALACE_MCP_TIMEOUT_MS / MEMPALACE_MCP_INIT_TIMEOUT_MS) and why the standalone stdio-watchdog shim was dropped. - Dockerfile.base: replace the TODO with a note pointing at the fix.
This commit is contained in:
+30
-19
@@ -13,28 +13,39 @@ Pre-v1.0.0 tags followed the pi npm version (`v{pi_version}[letter]`).
|
|||||||
|
|
||||||
## Unreleased
|
## Unreleased
|
||||||
|
|
||||||
### Known issues
|
### Fixed
|
||||||
|
|
||||||
- **`mempalace-mcp` can hang the pi TUI uninterruptibly** when the
|
- **`mempalace-mcp` no longer hangs the pi TUI uninterruptibly.** When
|
||||||
palace is bind-mounted from the macOS host (OrbStack virtiofs) and the
|
the palace is bind-mounted from the macOS host (OrbStack virtiofs) and
|
||||||
container opens a large `chroma.sqlite3` for the first time. Symptoms:
|
the container opened a large `chroma.sqlite3` for the first time, a
|
||||||
pi sits silently after a tool call, ESC does not abort, no progress
|
cold storage open / HNSW load could stall the server before it emitted
|
||||||
output. Root cause is **not** WAL contention with another writer (we
|
its JSON-RPC response. The awaiting promise then hung forever and the
|
||||||
initially suspected this and ruled it out — diagnosis 2026-06-13 with
|
TUI froze — ESC cancels the LLM stream, not a pending MCP tool call, so
|
||||||
no other mempalace process running). Most likely causes, in order:
|
there was no way out short of `docker exec <container> pkill -9 -f
|
||||||
1. SQLite cold-open `fcntl`/`flock` semantics over OrbStack virtiofs
|
mempalace-mcp` and restarting pi.
|
||||||
stalling the chromadb open path before mempalace-mcp emits its
|
|
||||||
`initialize` JSON-RPC response — pi blocks on the handshake.
|
|
||||||
2. Cold HNSW index load/rebuild for a large wing (~23k drawers) doing
|
|
||||||
random-access I/O over virtiofs.
|
|
||||||
3. Stale WAL recovery from a previously OOM-killed mempalace-mcp.
|
|
||||||
|
|
||||||
ESC not interrupting is a pi-side limitation: pi cancels the LLM stream
|
The fix lives in the `mempalace.ts` pi extension shipped by
|
||||||
but keeps awaiting the MCP child's stdio. There is no per-call MCP
|
**mempalace-toolkit** (cloned into the base at build time via
|
||||||
timeout in pi's config. Workaround when stuck:
|
`MEMPALACE_TOOLKIT_REF`, default `main`): the JSON-RPC client now arms
|
||||||
`docker exec <container> pkill -9 -f mempalace-mcp` then restart pi.
|
a **per-request** timeout. On expiry it rejects the request *and* kills
|
||||||
|
the stalled child (SIGTERM→SIGKILL), so pi surfaces an error instead of
|
||||||
|
hanging; the bridge then marks itself unavailable so subsequent calls
|
||||||
|
fail fast (restart pi to retry). This is deliberately per-REQUEST, not
|
||||||
|
a process-lifetime `timeout 60 mempalace-mcp` wrapper — the long-lived
|
||||||
|
server is only killed when a request genuinely stalls.
|
||||||
|
|
||||||
Planned fix: a thin Python stdio-watchdog shim in front of
|
Tunables (env): `MEMPALACE_MCP_TIMEOUT_MS` (tool-call timeout, default
|
||||||
|
`60000`), `MEMPALACE_MCP_INIT_TIMEOUT_MS` (initialize/tools-list
|
||||||
|
handshake, default `120000`); set either to `0` to disable. Requires a
|
||||||
|
base rebuild to pull the updated extension. The earlier plan of a
|
||||||
|
standalone Python stdio-watchdog shim was dropped: the extension
|
||||||
|
already owns request/response correlation, so a separate
|
||||||
|
framing-reparsing shim is unnecessary.
|
||||||
|
|
||||||
|
Still open (out of scope here): sharing one palace across harnesses
|
||||||
|
ideally wants a single host-side `mempalace-mcp` daemon multiplexing
|
||||||
|
stdio over a UNIX socket, so all clients share one writer on native
|
||||||
|
APFS rather than each cold-opening over virtiofs.
|
||||||
`mempalace-mcp` that applies a per-request timeout and kills the child
|
`mempalace-mcp` that applies a per-request timeout and kills the child
|
||||||
on stall, **without** killing the long-lived server itself (a naive
|
on stall, **without** killing the long-lived server itself (a naive
|
||||||
`timeout 60 mempalace-mcp` wrapper is wrong — it kills the server
|
`timeout 60 mempalace-mcp` wrapper is wrong — it kills the server
|
||||||
|
|||||||
+9
-8
@@ -280,14 +280,15 @@ RUN ARCH=$(case "${TARGETARCH}" in amd64) echo "x86_64" ;; arm64) echo "aarch64"
|
|||||||
# Always installed in the base. Set INSTALL_MEMPALACE=false at base-build
|
# Always installed in the base. Set INSTALL_MEMPALACE=false at base-build
|
||||||
# time to shave ~300 MB.
|
# time to shave ~300 MB.
|
||||||
#
|
#
|
||||||
# TODO(2026-06-13): wrap mempalace-mcp with a stdio-watchdog shim that
|
# Stall protection (fixed 2026-06-13): mempalace-mcp is launched by the
|
||||||
# applies a per-REQUEST timeout (not a per-process timeout — naive
|
# `mempalace.ts` pi extension from mempalace-toolkit (cloned below). That
|
||||||
# `timeout 60 mempalace-mcp` would kill the long-lived server mid-session).
|
# extension now applies a per-REQUEST timeout in its JSON-RPC client and
|
||||||
# When the palace is bind-mounted from macOS via OrbStack virtiofs, cold
|
# kills the child on stall, so a virtiofs cold-open of chroma.sqlite3 /
|
||||||
# chroma.sqlite3 open or HNSW load can stall the JSON-RPC `initialize`
|
# HNSW load can no longer hang the pi TUI uninterruptibly. Tunables:
|
||||||
# response and pi's TUI sits uninterruptibly (ESC cancels the LLM stream,
|
# MEMPALACE_MCP_TIMEOUT_MS (default 60000), MEMPALACE_MCP_INIT_TIMEOUT_MS
|
||||||
# not the MCP child stdio). See CHANGELOG.md "Unreleased > Known issues".
|
# (default 120000); 0 disables. A standalone stdio-watchdog shim is NOT
|
||||||
# Recovery today: `docker exec <ctr> pkill -9 -f mempalace-mcp`.
|
# needed — the extension already owns request/response correlation. See
|
||||||
|
# CHANGELOG.md "Unreleased > Fixed".
|
||||||
ARG INSTALL_MEMPALACE=true
|
ARG INSTALL_MEMPALACE=true
|
||||||
# Pin to a known-good version. Bump deliberately, not implicitly: an
|
# Pin to a known-good version. Bump deliberately, not implicitly: an
|
||||||
# unpinned install silently swept in mempalace 3.3.x/3.4.0 with a broken
|
# unpinned install silently swept in mempalace 3.3.x/3.4.0 with a broken
|
||||||
|
|||||||
Reference in New Issue
Block a user