docs: capture mempalace-mcp uninterruptible-hang diagnosis (2026-06-13)

Symptom: pi TUI blocks on a mempalace tool call, ESC does not abort.
Initial WAL-contention hypothesis ruled out (no other writer running).
Likely cause: virtiofs cold open of chroma.sqlite3 stalls the JSON-RPC
initialize handshake; pi has no per-call MCP timeout.

Recovery today: docker exec <ctr> pkill -9 -f mempalace-mcp, restart pi.

Planned fix (deferred until after opencode-devbox pi removal): stdio
watchdog shim with per-REQUEST timeout. A naive process-lifetime
timeout wrapper is wrong because mempalace-mcp is long-lived.

Sharing the palace across harnesses remains the goal.
This commit is contained in:
pi
2026-06-13 16:18:45 +02:00
parent ab5ff8ec56
commit 7f67c36a1c
2 changed files with 40 additions and 0 deletions
+31
View File
@@ -13,6 +13,37 @@ Pre-v1.0.0 tags followed the pi npm version (`v{pi_version}[letter]`).
## Unreleased
### Known issues
- **`mempalace-mcp` can hang the pi TUI uninterruptibly** when the
palace is bind-mounted from the macOS host (OrbStack virtiofs) and the
container opens a large `chroma.sqlite3` for the first time. Symptoms:
pi sits silently after a tool call, ESC does not abort, no progress
output. Root cause is **not** WAL contention with another writer (we
initially suspected this and ruled it out — diagnosis 2026-06-13 with
no other mempalace process running). Most likely causes, in order:
1. SQLite cold-open `fcntl`/`flock` semantics over OrbStack virtiofs
stalling the chromadb open path before mempalace-mcp emits its
`initialize` JSON-RPC response — pi blocks on the handshake.
2. Cold HNSW index load/rebuild for a large wing (~23k drawers) doing
random-access I/O over virtiofs.
3. Stale WAL recovery from a previously OOM-killed mempalace-mcp.
ESC not interrupting is a pi-side limitation: pi cancels the LLM stream
but keeps awaiting the MCP child's stdio. There is no per-call MCP
timeout in pi's config. Workaround when stuck:
`docker exec <container> pkill -9 -f mempalace-mcp` then restart pi.
Planned fix: a thin Python stdio-watchdog shim in front of
`mempalace-mcp` that applies a per-request timeout and kills the child
on stall, **without** killing the long-lived server itself (a naive
`timeout 60 mempalace-mcp` wrapper is wrong — it kills the server
mid-session). Sharing the palace across harnesses (native pi, container
pi, opencode) remains the goal — isolated palaces defeat the point.
Longer term: run a single mempalace-mcp daemon on the host and
multiplex stdio over a UNIX socket so all clients share one writer on
native APFS.
### Added
- **`dot-watch` helper** (`/usr/local/bin/dot-watch`) — auto-rerenders a