feat(pi-ext): self-healing respawn + scoped init timeout for mempalace-mcp

A stall-kill (or any crash) of mempalace-mcp was a permanent latch:
available flipped off and stayed off until pi restart. Now the next tool
call transparently respawns the server and retries.

- ensureAlive(): bounded respawn with capped exponential backoff
  (MEMPALACE_MCP_MAX_RESPAWNS, default 2; MEMPALACE_MCP_RESPAWN_BACKOFF_MS,
  default 1000). Respawn budget resets on any successful JSON-RPC response,
  so a recovered server regains full patience while a persistently-broken
  one hits the cap and stays down (no hot-loop).
- Init timeout default raised 120000 -> 300000 (scoped to init only): a
  genuine virtiofs cold-open shouldn't be killed mid-progress only to
  respawn and re-pay the same cost. Per-call timeout stays 60000.
- Concurrency hardening: generation counter so a late exit from a killed
  old process can't tear down a fresh respawn; explicit healthy flag
  replaces racy proc!=null liveness check.
- README: document self-heal, new env vars, and why generous-init +
  bounded-respawn compose rather than overlap.
This commit is contained in:
pi
2026-06-26 00:22:21 +02:00
parent a3b8829991
commit e12b624cf7
2 changed files with 163 additions and 18 deletions
+32 -3
View File
@@ -60,15 +60,44 @@ wedged server (classically: an OrbStack/virtiofs cold-open of a large
*forever*, which freezes the pi TUI — ESC cancels the LLM stream, not a
pending tool `execute()`. On timeout the extension rejects the request
**and** kills the stalled child (SIGTERM→SIGKILL), so pi gets a clear
error instead of hanging and later calls fail fast (`available` flips off;
restart pi to retry). This is a per-REQUEST timeout, not a process-lifetime
error instead of hanging. This is a per-REQUEST timeout, not a process-lifetime
one — the long-lived server is only killed when a request genuinely stalls.
- `MEMPALACE_MCP_TIMEOUT_MS` — tool-call/request timeout. Default `60000`.
Kept short on purpose: a *query* taking this long is genuinely wedged.
- `MEMPALACE_MCP_INIT_TIMEOUT_MS``initialize` + `tools/list` handshake
timeout (cold-open is expected to be slower here). Default `120000`.
timeout. Default `300000`. Deliberately generous: a genuine first
cold-open over virtiofs can legitimately take minutes, and killing a
still-progressing init only to respawn and re-pay the same cold cost is
strictly worse than waiting.
- Set either to `0` to disable (legacy unbounded behavior).
### Self-heal (respawn instead of a permanent latch)
A stall-kill (or any crash) used to be a **permanent** latch: `available`
flipped off and stayed off until you restarted pi. It is now self-healing —
the next tool call transparently respawns `mempalace-mcp` and retries.
- Respawns use **capped exponential backoff** so a persistently-broken
server can't hot-loop: `MEMPALACE_MCP_MAX_RESPAWNS` attempts (default
`2`; set `0` to disable self-heal and keep the old fail-fast latch),
with `MEMPALACE_MCP_RESPAWN_BACKOFF_MS` (default `1000`) doubled per
attempt.
- The budget **resets on any successful JSON-RPC response** — proof the
server is actually live — so a server that recovers regains full
patience, while one that keeps dying hits the cap and stays down (then
restart pi).
- Why the long init timeout and bounded respawn compose rather than
overlap: once a server has opened the palace once, the OS page cache is
warm, so respawn cold-opens are fast. The long init timeout prevents
killing a healthy *first* cold-open; the respawn handles a genuinely
dead server cheaply afterwards. (Note the HNSW deserialize is CPU work
that isn't page-cacheable across spawns, which is exactly why we can't
rely on respawn-warming alone and keep the generous init budget.)
- The initial startup is tolerant too: if the very first `start()` fails,
the extension runs the same bounded respawn before falling back to
fail-soft (pi keeps working without palace tools).
## Debugging
- `MEMPALACE_EXT_DEBUG=1` — surface `mempalace-mcp` stderr into pi's