feat(pi-ext): self-healing respawn + scoped init timeout for mempalace-mcp

A stall-kill (or any crash) of mempalace-mcp was a permanent latch: available flipped off and stayed off until pi restart. Now the next tool call transparently respawns the server and retries. - ensureAlive(): bounded respawn with capped exponential backoff (MEMPALACE_MCP_MAX_RESPAWNS, default 2; MEMPALACE_MCP_RESPAWN_BACKOFF_MS, default 1000). Respawn budget resets on any successful JSON-RPC response, so a recovered server regains full patience while a persistently-broken one hits the cap and stays down (no hot-loop). - Init timeout default raised 120000 -> 300000 (scoped to init only): a genuine virtiofs cold-open shouldn't be killed mid-progress only to respawn and re-pay the same cost. Per-call timeout stays 60000. - Concurrency hardening: generation counter so a late exit from a killed old process can't tear down a fresh respawn; explicit healthy flag replaces racy proc!=null liveness check. - README: document self-heal, new env vars, and why generous-init + bounded-respawn compose rather than overlap.
2026-06-26 00:22:21 +02:00
parent a3b8829991
commit e12b624cf7
2 changed files with 163 additions and 18 deletions
@@ -60,15 +60,44 @@ wedged server (classically: an OrbStack/virtiofs cold-open of a large
 *forever*, which freezes the pi TUI — ESC cancels the LLM stream, not a
 pending tool `execute()`. On timeout the extension rejects the request
 **and** kills the stalled child (SIGTERM→SIGKILL), so pi gets a clear
-error instead of hanging and later calls fail fast (`available` flips off;
-restart pi to retry). This is a per-REQUEST timeout, not a process-lifetime
+error instead of hanging. This is a per-REQUEST timeout, not a process-lifetime
 one — the long-lived server is only killed when a request genuinely stalls.

 - `MEMPALACE_MCP_TIMEOUT_MS` — tool-call/request timeout. Default `60000`.
+  Kept short on purpose: a *query* taking this long is genuinely wedged.
 - `MEMPALACE_MCP_INIT_TIMEOUT_MS` — `initialize` + `tools/list` handshake
-  timeout (cold-open is expected to be slower here). Default `120000`.
+  timeout. Default `300000`. Deliberately generous: a genuine first
+  cold-open over virtiofs can legitimately take minutes, and killing a
+  still-progressing init only to respawn and re-pay the same cold cost is
+  strictly worse than waiting.
 - Set either to `0` to disable (legacy unbounded behavior).

+### Self-heal (respawn instead of a permanent latch)
+
+A stall-kill (or any crash) used to be a **permanent** latch: `available`
+flipped off and stayed off until you restarted pi. It is now self-healing —
+the next tool call transparently respawns `mempalace-mcp` and retries.
+
+- Respawns use **capped exponential backoff** so a persistently-broken
+  server can't hot-loop: `MEMPALACE_MCP_MAX_RESPAWNS` attempts (default
+  `2`; set `0` to disable self-heal and keep the old fail-fast latch),
+  with `MEMPALACE_MCP_RESPAWN_BACKOFF_MS` (default `1000`) doubled per
+  attempt.
+- The budget **resets on any successful JSON-RPC response** — proof the
+  server is actually live — so a server that recovers regains full
+  patience, while one that keeps dying hits the cap and stays down (then
+  restart pi).
+- Why the long init timeout and bounded respawn compose rather than
+  overlap: once a server has opened the palace once, the OS page cache is
+  warm, so respawn cold-opens are fast. The long init timeout prevents
+  killing a healthy *first* cold-open; the respawn handles a genuinely
+  dead server cheaply afterwards. (Note the HNSW deserialize is CPU work
+  that isn't page-cacheable across spawns, which is exactly why we can't
+  rely on respawn-warming alone and keep the generous init budget.)
+- The initial startup is tolerant too: if the very first `start()` fails,
+  the extension runs the same bounded respawn before falling back to
+  fail-soft (pi keeps working without palace tools).
+
 ## Debugging

 - `MEMPALACE_EXT_DEBUG=1` — surface `mempalace-mcp` stderr into pi's