# Design: single-writer MemPalace broker (cross-host serialization) > **Status:** DRAFT / RFC — not yet implemented. Captures the design so it can be > picked up later. Authored 2026-06-14. > **Owner:** unassigned. **Tracking:** queue item #4 ("host-side mempalace-mcp > daemon over a UNIX/shared socket"). ## Problem The pi-devbox container's `~/.mempalace` (`/home/developer/.mempalace`) is a **virtiofs bind-mount of the host's `/Users/joakim/.mempalace`** (verified 2026-06-14 via `/proc/mounts`: `mac /home/developer/.mempalace virtiofs rw`). Container pi and host-native pi therefore **read and write ONE shared palace** — full memory parity already exists; nothing needs to be built to *enable* sharing. The actual hazard is the opposite of sharing: **concurrency**. Two pi processes (one native on the host, one in the container) can open the same `chroma.sqlite3` / `knowledge_graph.sqlite3` and write at the same time. The palace directory already shows the scars of this: - `chroma.sqlite3.broken-20260505` - many `*.corrupt-20260528` - a long run of `*.drift-2026*` - `locks/` with `mine_palace_*.lock` files, including a **stale** one. These are mempalace's defensive lock + auto-snapshot/repair machinery firing under concurrent access. ### Why a shared lock file is NOT sufficient The container runs inside a Linux VM (OrbStack / Docker Desktop on macOS); the palace bytes live on the macOS host, surfaced into the VM via virtiofs. Consequences: - A **UNIX-domain socket file** visible at `~/.mempalace/broker.sock` inside the container is a *host-kernel* object. The container's kernel can see the inode but **cannot connect to it** across the VM boundary. - **flock / advisory lockfiles are not coherent across the host↔VM boundary.** A lock taken on the host is not reliably seen in the container and vice-versa. (The stale `mine_palace_*.lock` is direct evidence the existing lock scheme is not bulletproof across this boundary.) **Therefore the only trustworthy serialization is to route every write through a single process.** That single process is the broker. The design question is *not* "how do we lock" — it's "**where does the one writer live, and how does every pi (host or container) reach it across the VM boundary?**" ## Goals 1. Exactly one process opens the palace SQLite files at any time (single writer; concurrent reads are fine). 2. Works in all three topologies on a given host: - native pi only, - native pi + container pi, - container pi only. 3. pi configuration is **identical** in every topology (no per-environment MCP config divergence). 4. No new corruption pathway introduced; degrade safely when the broker is genuinely unreachable and there are no peers. ### Non-goals (for this iteration) - opencode / opencode-devbox co-existence (see "Co-existence with opencode" below — deferred until the pi case is solved). - Multi-host palace replication. This is about one host's local palace. - Changing mempalace's on-disk format or its public MCP tool surface. ## Architecture ``` pi (host) ─stdio─► mp-shim ─┐ ├─► mempalace-broker ─► chroma.sqlite3 pi (ctr) ─stdio─► mp-shim ─┘ (SINGLE owner; knowledge_graph.sqlite3 serialized writer, + in-memory HNSW index concurrent readers) ``` ### `mempalace-broker` A long-lived process that is the **only** opener of the palace SQLite files. It: - runs the real mempalace engine, - holds the HNSW index in memory, - pushes all mutations through a single writer queue (reads may fan out), - exposes the mempalace MCP JSON-RPC surface over one or more transports, - is the canonical owner of palace state for the lifetime of the host session. **Bonus:** a single always-resident owner also eliminates the stale-HNSW-index problem that `mempalace_reconnect` exists to work around — there is never an external writer to desync the in-memory index against. ### `mp-shim` A tiny stdio↔transport adapter. pi's mempalace MCP config points at the shim **everywhere, unchanged**. pi still believes it is speaking stdio MCP to a local server; the shim forwards JSON-RPC to the broker over whichever transport is available, and handles all discovery / startup / election complexity. Keeping pi's config identical across topologies is a hard requirement (goal #3) and the shim is what makes it possible. ## Canonical owner = the host The broker's home is **always the host**, because: 1. The palace bytes physically live there (`/Users/joakim/.mempalace`). 2. The host outlives any container — ownership does not evaporate on `docker compose down`. 3. Containers already have a route back to it (`host.docker.internal` and the verified dssh ControlMaster bridge). The broker binds **two listeners feeding one queue**: - **AF_UNIX** at `$MEMPALACE_PATH/broker.sock` — for host-native pi (fast, filesystem-perms-secured). - a **cross-boundary** transport for container clients (below). ## Transport matrix | Topology | Broker runs on | Host pi reaches it via | Container pi reaches it via | |---|---|---|---| | native only | host | AF_UNIX socket | — | | native + container | host | AF_UNIX socket | SSH-forwarded socket (preferred) or TCP | | container only | host (started via bridge) | — | SSH-forwarded socket or TCP | ### Cross-boundary transport options **(a) SSH-forwarded UNIX socket over the existing dssh ControlMaster — PREFERRED.** The container's `setup-lan-access.sh` already establishes a ControlMaster to the host with `ControlPersist 4h`. The container shim forwards the host broker socket over that master: ``` ssh -F ~/.ssh-local/config \ -L "$XDG_RUNTIME_DIR/mp.sock:$HOME/.mempalace/broker.sock" host ``` then connects to the local forwarded socket. Auth = SSH key; nothing is LAN-exposed; no extra shared secret needed; rides the persistent master so setup cost is near-zero. Most portable across non-OrbStack hosts. **(b) TCP on `host.docker.internal:PORT` — fallback.** Simpler, but the broker must bind a routable interface (not just `127.0.0.1`), which requires a **shared-secret token** to prevent other local/LAN processes from talking to it. The token is written to `broker.json` in the virtiofs-mounted palace dir (readable from both sides). More care required to get the bind + auth right. ## Discovery + on-demand start (the shim's algorithm) Run by the shim on every pi session start, so it is correct regardless of who is already running: ``` 1. If $MEMPALACE_BROKER is set → use it verbatim (escape hatch). 2. Read $MEMPALACE_PATH/broker.json → endpoint + pid + token. Try to connect (UNIX if host; forwarded-sock / TCP if container). If connected & healthy → done. 3. Broker not reachable → START IT: - On host: flock($MEMPALACE_PATH/broker.lock, non-blocking) win → exec broker, wait for broker.json, connect. lose → someone else is starting it; backoff + retry connect. - In container: run `ssh host 'mempalace-broker --ensure'` (idempotent; performs the SAME flock election ON THE HOST), then forward + connect. 4. Last-resort fallback (no broker, cannot start one): open the palace DIRECTLY — but ONLY after asserting this process is the sole writer (no other live broker/pid recorded in broker.json). Degrades to today's behaviour for the genuinely-alone case; never used when a broker exists. ``` **Key trick:** host-side election uses `flock` on the host, where it is coherent (same kernel) — bulletproof. The cross-boundary case **never relies on cross-VM locking**; it relies on `ssh host 'broker --ensure'`, which runs the election on the host where flock works. That is what makes the design topology-independent. ### Lifecycle - Broker writes `broker.json` (endpoint + pid + token) **atomically** after binding. - Broker holds `broker.lock` for its entire lifetime → at most one host broker. - Idle-exit after N minutes with no connected clients; the next client re-elects. (Or keep-alive; idle-exit is friendlier on resources.) - Clients reclaim a stale lock if the pid recorded in `broker.json` is dead. - Clients retry with backoff while a broker is mid-startup. ## Engine vs. shim — what the image must still ship The component bundled in the images today is really **two separable pieces**: - the **mempalace engine** — opens the SQLite files, computes embeddings, owns the HNSW index (the heavy part: chromadb, embedding model, etc.), and - the thin client surface pi actually talks to. In the brokered design these split cleanly: - the **broker** is the only thing that runs the *engine*; - the **shim** is **engine-free** — it just forwards MCP JSON-RPC. It needs no chromadb, no embedding model, no heavy deps. Embeddings/search happen broker-side. (Potential image-slimming opportunity, though see below for why we keep the engine bundled anyway.) Whether the bundled engine is "used as-is" or merely fronted by the broker **depends on who owns the broker**: **A) Host runs the broker (native, or native+container — the common case).** The *host's* engine is authoritative and used as-is. The broker is purely an intermediate step so writes can't collide; the host engine does the read/write. The container's **bundled engine is dormant** — the container uses only its shim to reach the host broker. The engine in the image is not needed for this path. **B) Container lands on a host with no mempalace (fresh-host case).** The bundled engine earns its keep — you cannot conjure an engine onto the host without installing one. Either the container runs the broker *itself* (in-container ownership, bundled engine used as-is) or it falls back to degraded direct mode (single writer, bundled engine used directly). **Decision: keep shipping the engine in the images** — but for three specific reasons, not because the brokered path needs it: 1. **Self-containedness** — pi-devbox's promise is "works on any host." A container with no memory unless the host pre-installed mempalace breaks that, especially for the Docker Hub audience. 2. **Fresh-host bootstrap** (case B) — no host engine to borrow. 3. **Degraded fallback** — the no-broker-reachable path opens the DB locally and needs the engine present. In the host-managed common case the bundled engine is just dormant insurance; the shim is the only piece the container actively uses. ### Version-coherence note Because **only the broker's engine ever writes**, its version defines the on-disk format. Host-vs-bundled engine version skew is therefore **harmless in the brokered path** (only one engine ever touches the bytes). Skew only bites in **degraded direct mode**, where the container writes with a possibly-different engine version than the host would. This argues for the broker pinning/owning the authoritative engine version and treating the bundled engine as fallback-only. > Partially resolves the "where the broker binary ships" open question below: > the **shim** must ship on both sides; the **engine** must ship on the host > (to run the broker) and stays bundled in the image as fallback/bootstrap > insurance, not as the authoritative writer in the common case. ## The genuinely hard case **Container-only with no SSH bridge configured** (e.g. plain Linux Docker, `HOST_SSH_USER` unset, no `host.docker.internal`). The container cannot start or reach a host broker. Options, none free: 1. **Require the bridge** for multi-writer container setups, and document it as a precondition. Reasonable: pi-devbox already ships `setup-lan-access.sh` and the bridge is the supported path. 2. **Run the broker inside the container**, publishing a Docker port the host can later reach. Works, but inverts ownership and the broker dies with the container — only acceptable if containers are the *sole* writers on that host. 3. **Accept degraded mode** (algorithm step 4): a lone container with no peers has no concurrency, so direct access is safe *as long as* nothing else opens the palace concurrently. The host shim also checks `broker.json` before opening directly, so a later host pi will not silently start a second uncoordinated writer. **Summary:** fully robust for native-only, native+container, and container-only-with-bridge. The only residual sharp edge is container-only *without* a bridge *and* a future concurrent host writer — intrinsic (no shared coherent lock exists across that boundary), best handled by mandating the bridge rather than pretending file locks work. ## Co-existence with opencode / opencode-devbox (DEFERRED — context only) The palace is shared by more than pi. opencode (native) and opencode-devbox (container) also write to the same `~/.mempalace`. **Assumption to verify:** opencode sessions write to **different wings** than pi sessions (pi uses `wing_pi`, diaries per-agent, etc.), so cross-tool intermixing into the *same* destination may be a non-issue at the application level. However, the corruption risk here is at the **SQLite-file level, not the wing level** — two processes writing different wings of the *same* `chroma.sqlite3` concurrently is still a concurrent write to one file. So the broker, once it exists, is the right serialization point for opencode too: opencode's mempalace client would route through the same broker via the same shim mechanism. **Decision:** do not design for opencode co-existence yet. Resolve the pi case first; then revisit whether opencode clients adopt the same shim. The residual risk in the interim is native + container *opencode* sessions writing the same palace simultaneously — explicitly deferred ("cross that bridge later"). ## Open questions / TODO before implementation - Does the mempalace engine expose an embeddable entrypoint suitable for running inside a long-lived broker, or does the broker wrap the existing MCP server binary and multiplex stdio clients onto it? (Affects whether reads can truly fan out or are also serialized.) - Idle-exit timeout default + whether to expose it via env. - `broker.json` schema + atomic-write + stale-pid-reclaim details. - TCP-path token handling and safe bind interface selection on Linux Docker (`--add-host=host.docker.internal:host-gateway`). - Where the broker binary ships: baked into `Dockerfile.base`? host install via pi-toolkit / mempalace-toolkit? Both, since both sides need the shim and the host needs the broker. - Smoke-test plan: prove single-writer invariant under a deliberate concurrent host+container write storm (should produce zero `.corrupt`/`.drift` snapshots).