feat(pi-ext): self-healing respawn + scoped init timeout for mempalace-mcp
A stall-kill (or any crash) of mempalace-mcp was a permanent latch: available flipped off and stayed off until pi restart. Now the next tool call transparently respawns the server and retries. - ensureAlive(): bounded respawn with capped exponential backoff (MEMPALACE_MCP_MAX_RESPAWNS, default 2; MEMPALACE_MCP_RESPAWN_BACKOFF_MS, default 1000). Respawn budget resets on any successful JSON-RPC response, so a recovered server regains full patience while a persistently-broken one hits the cap and stays down (no hot-loop). - Init timeout default raised 120000 -> 300000 (scoped to init only): a genuine virtiofs cold-open shouldn't be killed mid-progress only to respawn and re-pay the same cost. Per-call timeout stays 60000. - Concurrency hardening: generation counter so a late exit from a killed old process can't tear down a fresh respawn; explicit healthy flag replaces racy proc!=null liveness check. - README: document self-heal, new env vars, and why generous-init + bounded-respawn compose rather than overlap.
This commit is contained in:
+32
-3
@@ -60,15 +60,44 @@ wedged server (classically: an OrbStack/virtiofs cold-open of a large
|
||||
*forever*, which freezes the pi TUI — ESC cancels the LLM stream, not a
|
||||
pending tool `execute()`. On timeout the extension rejects the request
|
||||
**and** kills the stalled child (SIGTERM→SIGKILL), so pi gets a clear
|
||||
error instead of hanging and later calls fail fast (`available` flips off;
|
||||
restart pi to retry). This is a per-REQUEST timeout, not a process-lifetime
|
||||
error instead of hanging. This is a per-REQUEST timeout, not a process-lifetime
|
||||
one — the long-lived server is only killed when a request genuinely stalls.
|
||||
|
||||
- `MEMPALACE_MCP_TIMEOUT_MS` — tool-call/request timeout. Default `60000`.
|
||||
Kept short on purpose: a *query* taking this long is genuinely wedged.
|
||||
- `MEMPALACE_MCP_INIT_TIMEOUT_MS` — `initialize` + `tools/list` handshake
|
||||
timeout (cold-open is expected to be slower here). Default `120000`.
|
||||
timeout. Default `300000`. Deliberately generous: a genuine first
|
||||
cold-open over virtiofs can legitimately take minutes, and killing a
|
||||
still-progressing init only to respawn and re-pay the same cold cost is
|
||||
strictly worse than waiting.
|
||||
- Set either to `0` to disable (legacy unbounded behavior).
|
||||
|
||||
### Self-heal (respawn instead of a permanent latch)
|
||||
|
||||
A stall-kill (or any crash) used to be a **permanent** latch: `available`
|
||||
flipped off and stayed off until you restarted pi. It is now self-healing —
|
||||
the next tool call transparently respawns `mempalace-mcp` and retries.
|
||||
|
||||
- Respawns use **capped exponential backoff** so a persistently-broken
|
||||
server can't hot-loop: `MEMPALACE_MCP_MAX_RESPAWNS` attempts (default
|
||||
`2`; set `0` to disable self-heal and keep the old fail-fast latch),
|
||||
with `MEMPALACE_MCP_RESPAWN_BACKOFF_MS` (default `1000`) doubled per
|
||||
attempt.
|
||||
- The budget **resets on any successful JSON-RPC response** — proof the
|
||||
server is actually live — so a server that recovers regains full
|
||||
patience, while one that keeps dying hits the cap and stays down (then
|
||||
restart pi).
|
||||
- Why the long init timeout and bounded respawn compose rather than
|
||||
overlap: once a server has opened the palace once, the OS page cache is
|
||||
warm, so respawn cold-opens are fast. The long init timeout prevents
|
||||
killing a healthy *first* cold-open; the respawn handles a genuinely
|
||||
dead server cheaply afterwards. (Note the HNSW deserialize is CPU work
|
||||
that isn't page-cacheable across spawns, which is exactly why we can't
|
||||
rely on respawn-warming alone and keep the generous init budget.)
|
||||
- The initial startup is tolerant too: if the very first `start()` fails,
|
||||
the extension runs the same bounded respawn before falling back to
|
||||
fail-soft (pi keeps working without palace tools).
|
||||
|
||||
## Debugging
|
||||
|
||||
- `MEMPALACE_EXT_DEBUG=1` — surface `mempalace-mcp` stderr into pi's
|
||||
|
||||
Reference in New Issue
Block a user