docs: capture mempalace-mcp uninterruptible-hang diagnosis (2026-06-13)
Symptom: pi TUI blocks on a mempalace tool call, ESC does not abort. Initial WAL-contention hypothesis ruled out (no other writer running). Likely cause: virtiofs cold open of chroma.sqlite3 stalls the JSON-RPC initialize handshake; pi has no per-call MCP timeout. Recovery today: docker exec <ctr> pkill -9 -f mempalace-mcp, restart pi. Planned fix (deferred until after opencode-devbox pi removal): stdio watchdog shim with per-REQUEST timeout. A naive process-lifetime timeout wrapper is wrong because mempalace-mcp is long-lived. Sharing the palace across harnesses remains the goal.
This commit is contained in:
@@ -13,6 +13,37 @@ Pre-v1.0.0 tags followed the pi npm version (`v{pi_version}[letter]`).
|
||||
|
||||
## Unreleased
|
||||
|
||||
### Known issues
|
||||
|
||||
- **`mempalace-mcp` can hang the pi TUI uninterruptibly** when the
|
||||
palace is bind-mounted from the macOS host (OrbStack virtiofs) and the
|
||||
container opens a large `chroma.sqlite3` for the first time. Symptoms:
|
||||
pi sits silently after a tool call, ESC does not abort, no progress
|
||||
output. Root cause is **not** WAL contention with another writer (we
|
||||
initially suspected this and ruled it out — diagnosis 2026-06-13 with
|
||||
no other mempalace process running). Most likely causes, in order:
|
||||
1. SQLite cold-open `fcntl`/`flock` semantics over OrbStack virtiofs
|
||||
stalling the chromadb open path before mempalace-mcp emits its
|
||||
`initialize` JSON-RPC response — pi blocks on the handshake.
|
||||
2. Cold HNSW index load/rebuild for a large wing (~23k drawers) doing
|
||||
random-access I/O over virtiofs.
|
||||
3. Stale WAL recovery from a previously OOM-killed mempalace-mcp.
|
||||
|
||||
ESC not interrupting is a pi-side limitation: pi cancels the LLM stream
|
||||
but keeps awaiting the MCP child's stdio. There is no per-call MCP
|
||||
timeout in pi's config. Workaround when stuck:
|
||||
`docker exec <container> pkill -9 -f mempalace-mcp` then restart pi.
|
||||
|
||||
Planned fix: a thin Python stdio-watchdog shim in front of
|
||||
`mempalace-mcp` that applies a per-request timeout and kills the child
|
||||
on stall, **without** killing the long-lived server itself (a naive
|
||||
`timeout 60 mempalace-mcp` wrapper is wrong — it kills the server
|
||||
mid-session). Sharing the palace across harnesses (native pi, container
|
||||
pi, opencode) remains the goal — isolated palaces defeat the point.
|
||||
Longer term: run a single mempalace-mcp daemon on the host and
|
||||
multiplex stdio over a UNIX socket so all clients share one writer on
|
||||
native APFS.
|
||||
|
||||
### Added
|
||||
|
||||
- **`dot-watch` helper** (`/usr/local/bin/dot-watch`) — auto-rerenders a
|
||||
|
||||
@@ -279,6 +279,15 @@ RUN ARCH=$(case "${TARGETARCH}" in amd64) echo "x86_64" ;; arm64) echo "aarch64"
|
||||
# Provides semantic search over conversation history via 29 MCP tools.
|
||||
# Always installed in the base. Set INSTALL_MEMPALACE=false at base-build
|
||||
# time to shave ~300 MB.
|
||||
#
|
||||
# TODO(2026-06-13): wrap mempalace-mcp with a stdio-watchdog shim that
|
||||
# applies a per-REQUEST timeout (not a per-process timeout — naive
|
||||
# `timeout 60 mempalace-mcp` would kill the long-lived server mid-session).
|
||||
# When the palace is bind-mounted from macOS via OrbStack virtiofs, cold
|
||||
# chroma.sqlite3 open or HNSW load can stall the JSON-RPC `initialize`
|
||||
# response and pi's TUI sits uninterruptibly (ESC cancels the LLM stream,
|
||||
# not the MCP child stdio). See CHANGELOG.md "Unreleased > Known issues".
|
||||
# Recovery today: `docker exec <ctr> pkill -9 -f mempalace-mcp`.
|
||||
ARG INSTALL_MEMPALACE=true
|
||||
# Pin to a known-good version. Bump deliberately, not implicitly: an
|
||||
# unpinned install silently swept in mempalace 3.3.x/3.4.0 with a broken
|
||||
|
||||
Reference in New Issue
Block a user