From b17dc1fa1f52c26ab0957f9f7be65a3b9189a9ca Mon Sep 17 00:00:00 2001
From: pi <pi@devbox>
Date: Sun, 14 Jun 2026 18:06:47 +0200
Subject: [PATCH] docs: add single-writer MemPalace broker design (RFC, queue
 #4)

---
 docs/mempalace-broker-design.md | 243 ++++++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)
 create mode 100644 docs/mempalace-broker-design.md

diff --git a/docs/mempalace-broker-design.md b/docs/mempalace-broker-design.md
new file mode 100644
index 0000000..aea6dc9
--- /dev/null
+++ b/docs/mempalace-broker-design.md
@@ -0,0 +1,243 @@
+# Design: single-writer MemPalace broker (cross-host serialization)
+
+> **Status:** DRAFT / RFC — not yet implemented. Captures the design so it can be
+> picked up later. Authored 2026-06-14.
+> **Owner:** unassigned. **Tracking:** queue item #4 ("host-side mempalace-mcp
+> daemon over a UNIX/shared socket").
+
+## Problem
+
+The pi-devbox container's `~/.mempalace` (`/home/developer/.mempalace`) is a
+**virtiofs bind-mount of the host's `/Users/joakim/.mempalace`** (verified
+2026-06-14 via `/proc/mounts`: `mac /home/developer/.mempalace virtiofs rw`).
+Container pi and host-native pi therefore **read and write ONE shared palace** —
+full memory parity already exists; nothing needs to be built to *enable* sharing.
+
+The actual hazard is the opposite of sharing: **concurrency**. Two pi processes
+(one native on the host, one in the container) can open the same
+`chroma.sqlite3` / `knowledge_graph.sqlite3` and write at the same time. The
+palace directory already shows the scars of this:
+
+- `chroma.sqlite3.broken-20260505`
+- many `*.corrupt-20260528`
+- a long run of `*.drift-2026*`
+- `locks/` with `mine_palace_*.lock` files, including a **stale** one.
+
+These are mempalace's defensive lock + auto-snapshot/repair machinery firing
+under concurrent access.
+
+### Why a shared lock file is NOT sufficient
+
+The container runs inside a Linux VM (OrbStack / Docker Desktop on macOS); the
+palace bytes live on the macOS host, surfaced into the VM via virtiofs.
+Consequences:
+
+- A **UNIX-domain socket file** visible at `~/.mempalace/broker.sock` inside the
+  container is a *host-kernel* object. The container's kernel can see the inode
+  but **cannot connect to it** across the VM boundary.
+- **flock / advisory lockfiles are not coherent across the host↔VM boundary.**
+  A lock taken on the host is not reliably seen in the container and vice-versa.
+  (The stale `mine_palace_*.lock` is direct evidence the existing lock scheme is
+  not bulletproof across this boundary.)
+
+**Therefore the only trustworthy serialization is to route every write through a
+single process.** That single process is the broker. The design question is *not*
+"how do we lock" — it's "**where does the one writer live, and how does every pi
+(host or container) reach it across the VM boundary?**"
+
+## Goals
+
+1. Exactly one process opens the palace SQLite files at any time (single writer;
+   concurrent reads are fine).
+2. Works in all three topologies on a given host:
+   - native pi only,
+   - native pi + container pi,
+   - container pi only.
+3. pi configuration is **identical** in every topology (no per-environment MCP
+   config divergence).
+4. No new corruption pathway introduced; degrade safely when the broker is
+   genuinely unreachable and there are no peers.
+
+### Non-goals (for this iteration)
+
+- opencode / opencode-devbox co-existence (see "Co-existence with opencode"
+  below — deferred until the pi case is solved).
+- Multi-host palace replication. This is about one host's local palace.
+- Changing mempalace's on-disk format or its public MCP tool surface.
+
+## Architecture
+
+```
+pi (host)  ─stdio─►  mp-shim ─┐
+                              ├─►  mempalace-broker  ─►  chroma.sqlite3
+pi (ctr)   ─stdio─►  mp-shim ─┘     (SINGLE owner;        knowledge_graph.sqlite3
+                                    serialized writer,    + in-memory HNSW index
+                                    concurrent readers)
+```
+
+### `mempalace-broker`
+
+A long-lived process that is the **only** opener of the palace SQLite files. It:
+
+- runs the real mempalace engine,
+- holds the HNSW index in memory,
+- pushes all mutations through a single writer queue (reads may fan out),
+- exposes the mempalace MCP JSON-RPC surface over one or more transports,
+- is the canonical owner of palace state for the lifetime of the host session.
+
+**Bonus:** a single always-resident owner also eliminates the stale-HNSW-index
+problem that `mempalace_reconnect` exists to work around — there is never an
+external writer to desync the in-memory index against.
+
+### `mp-shim`
+
+A tiny stdio↔transport adapter. pi's mempalace MCP config points at the shim
+**everywhere, unchanged**. pi still believes it is speaking stdio MCP to a local
+server; the shim forwards JSON-RPC to the broker over whichever transport is
+available, and handles all discovery / startup / election complexity. Keeping
+pi's config identical across topologies is a hard requirement (goal #3) and the
+shim is what makes it possible.
+
+## Canonical owner = the host
+
+The broker's home is **always the host**, because:
+
+1. The palace bytes physically live there (`/Users/joakim/.mempalace`).
+2. The host outlives any container — ownership does not evaporate on
+   `docker compose down`.
+3. Containers already have a route back to it (`host.docker.internal` and the
+   verified dssh ControlMaster bridge).
+
+The broker binds **two listeners feeding one queue**:
+
+- **AF_UNIX** at `$MEMPALACE_PATH/broker.sock` — for host-native pi (fast,
+  filesystem-perms-secured).
+- a **cross-boundary** transport for container clients (below).
+
+## Transport matrix
+
+| Topology | Broker runs on | Host pi reaches it via | Container pi reaches it via |
+|---|---|---|---|
+| native only | host | AF_UNIX socket | — |
+| native + container | host | AF_UNIX socket | SSH-forwarded socket (preferred) or TCP |
+| container only | host (started via bridge) | — | SSH-forwarded socket or TCP |
+
+### Cross-boundary transport options
+
+**(a) SSH-forwarded UNIX socket over the existing dssh ControlMaster — PREFERRED.**
+The container's `setup-lan-access.sh` already establishes a ControlMaster to the
+host with `ControlPersist 4h`. The container shim forwards the host broker socket
+over that master:
+
+```
+ssh -F ~/.ssh-local/config \
+    -L "$XDG_RUNTIME_DIR/mp.sock:$HOME/.mempalace/broker.sock" host
+```
+
+then connects to the local forwarded socket. Auth = SSH key; nothing is
+LAN-exposed; no extra shared secret needed; rides the persistent master so setup
+cost is near-zero. Most portable across non-OrbStack hosts.
+
+**(b) TCP on `host.docker.internal:PORT` — fallback.** Simpler, but the broker
+must bind a routable interface (not just `127.0.0.1`), which requires a
+**shared-secret token** to prevent other local/LAN processes from talking to it.
+The token is written to `broker.json` in the virtiofs-mounted palace dir
+(readable from both sides). More care required to get the bind + auth right.
+
+## Discovery + on-demand start (the shim's algorithm)
+
+Run by the shim on every pi session start, so it is correct regardless of who is
+already running:
+
+```
+1. If $MEMPALACE_BROKER is set        → use it verbatim (escape hatch).
+2. Read $MEMPALACE_PATH/broker.json   → endpoint + pid + token.
+   Try to connect (UNIX if host; forwarded-sock / TCP if container).
+   If connected & healthy             → done.
+3. Broker not reachable → START IT:
+   - On host:      flock($MEMPALACE_PATH/broker.lock, non-blocking)
+                     win  → exec broker, wait for broker.json, connect.
+                     lose → someone else is starting it; backoff + retry connect.
+   - In container: run `ssh host 'mempalace-broker --ensure'` (idempotent;
+                   performs the SAME flock election ON THE HOST), then forward +
+                   connect.
+4. Last-resort fallback (no broker, cannot start one):
+   open the palace DIRECTLY — but ONLY after asserting this process is the sole
+   writer (no other live broker/pid recorded in broker.json). Degrades to
+   today's behaviour for the genuinely-alone case; never used when a broker
+   exists.
+```
+
+**Key trick:** host-side election uses `flock` on the host, where it is coherent
+(same kernel) — bulletproof. The cross-boundary case **never relies on cross-VM
+locking**; it relies on `ssh host 'broker --ensure'`, which runs the election on
+the host where flock works. That is what makes the design topology-independent.
+
+### Lifecycle
+
+- Broker writes `broker.json` (endpoint + pid + token) **atomically** after
+  binding.
+- Broker holds `broker.lock` for its entire lifetime → at most one host broker.
+- Idle-exit after N minutes with no connected clients; the next client
+  re-elects. (Or keep-alive; idle-exit is friendlier on resources.)
+- Clients reclaim a stale lock if the pid recorded in `broker.json` is dead.
+- Clients retry with backoff while a broker is mid-startup.
+
+## The genuinely hard case
+
+**Container-only with no SSH bridge configured** (e.g. plain Linux Docker,
+`HOST_SSH_USER` unset, no `host.docker.internal`). The container cannot start or
+reach a host broker. Options, none free:
+
+1. **Require the bridge** for multi-writer container setups, and document it as a
+   precondition. Reasonable: pi-devbox already ships `setup-lan-access.sh` and
+   the bridge is the supported path.
+2. **Run the broker inside the container**, publishing a Docker port the host can
+   later reach. Works, but inverts ownership and the broker dies with the
+   container — only acceptable if containers are the *sole* writers on that host.
+3. **Accept degraded mode** (algorithm step 4): a lone container with no peers
+   has no concurrency, so direct access is safe *as long as* nothing else opens
+   the palace concurrently. The host shim also checks `broker.json` before
+   opening directly, so a later host pi will not silently start a second
+   uncoordinated writer.
+
+**Summary:** fully robust for native-only, native+container, and
+container-only-with-bridge. The only residual sharp edge is container-only
+*without* a bridge *and* a future concurrent host writer — intrinsic (no shared
+coherent lock exists across that boundary), best handled by mandating the bridge
+rather than pretending file locks work.
+
+## Co-existence with opencode / opencode-devbox (DEFERRED — context only)
+
+The palace is shared by more than pi. opencode (native) and opencode-devbox
+(container) also write to the same `~/.mempalace`. **Assumption to verify:**
+opencode sessions write to **different wings** than pi sessions (pi uses
+`wing_pi`, diaries per-agent, etc.), so cross-tool intermixing into the *same*
+destination may be a non-issue at the application level.
+
+However, the corruption risk here is at the **SQLite-file level, not the wing
+level** — two processes writing different wings of the *same* `chroma.sqlite3`
+concurrently is still a concurrent write to one file. So the broker, once it
+exists, is the right serialization point for opencode too: opencode's mempalace
+client would route through the same broker via the same shim mechanism.
+
+**Decision:** do not design for opencode co-existence yet. Resolve the pi case
+first; then revisit whether opencode clients adopt the same shim. The residual
+risk in the interim is native + container *opencode* sessions writing the same
+palace simultaneously — explicitly deferred ("cross that bridge later").
+
+## Open questions / TODO before implementation
+
+- Does the mempalace engine expose an embeddable entrypoint suitable for running
+  inside a long-lived broker, or does the broker wrap the existing MCP server
+  binary and multiplex stdio clients onto it? (Affects whether reads can truly
+  fan out or are also serialized.)
+- Idle-exit timeout default + whether to expose it via env.
+- `broker.json` schema + atomic-write + stale-pid-reclaim details.
+- TCP-path token handling and safe bind interface selection on Linux Docker
+  (`--add-host=host.docker.internal:host-gateway`).
+- Where the broker binary ships: baked into `Dockerfile.base`? host install via
+  pi-toolkit / mempalace-toolkit? Both, since both sides need the shim and the
+  host needs the broker.
+- Smoke-test plan: prove single-writer invariant under a deliberate concurrent
+  host+container write storm (should produce zero `.corrupt`/`.drift` snapshots).