mempalace-session: make --dry-run dedup-aware

A --dry-run report showed all qualifying sessions without indicating
which would actually hit the palace on a real run. On a second run
against an already-mined corpus this was misleading — output said
'Exported 62 session(s)' but the real mine step would skip all 62.

The wrapper now queries the palace's chroma.sqlite3 (read-only, via
file:...?mode=ro URI) for source_file values under the staging dir,
then tags each exported session as [NEW] or [SKIP] during listing and
reports the split in the summary:

  Exported 62 session(s) to ~/.cache/mempalace-session/wing_conversations
    0 new   → will be filed on mine
    62 already filed → will be skipped (dedup by source_file)

  --dry-run: no new sessions to mine. A real run would skip all 62.

Implementation notes:
- Classification is best-effort. If the palace is unreachable (fresh
  install, moved, permission-denied, file missing) the wrapper falls
  back to treating all exports as NEW — the real mine step still
  delegates dedup to 'mempalace mine --mode convos' which is the
  authoritative source of truth. Getting the classification wrong
  in --dry-run is cosmetic; behaviour of a real run is unchanged.
- Palace path respects $MEMPALACE_PATH env var for non-default setups.
- Same classification also shown on a real (non-dry-run) mine so users
  see upfront how much of the export set is actually new before the
  miner runs.

Verified both directions:
- All-already-filed case (current box, 62 sessions in palace): reports
  0 new, 62 skipped. --dry-run message correctly says 'would skip all'.
- Partial case (simulated by deleting one session's metadata from
  palace): reports 1 new, 61 skipped. --dry-run message correctly
  says 'would file 1 new'. Palace was restored from backup
  immediately after the test.

README and SKILL.md both updated with the new dedup-aware output and
a direct answer to the FAQ 'will it mine the same sessions again?'
This commit is contained in:
Joakim Persson
2026-04-30 08:33:36 +00:00
parent 72e7019101
commit 349a3a3d3d
3 changed files with 81 additions and 6 deletions
+12
View File
@@ -276,6 +276,18 @@ mempalace-session --help
**Dedup:** staging at `~/.cache/mempalace-session/<wing>/` with deterministic per-session filenames (`<slug>_<id>.jsonl`). The convos miner keys on `source_file`, so re-runs skip unchanged sessions. To force re-mining a session, delete its JSONL from the staging dir.
**`--dry-run` is dedup-aware.** Each session is tagged `[NEW]` (would be filed) or `[SKIP]` (already in the palace), and the summary breaks down the count:
```
Exported 62 session(s) to ~/.cache/mempalace-session/wing_conversations
0 new → will be filed on mine
62 already filed → will be skipped (dedup by source_file)
--dry-run: no new sessions to mine. A real run would skip all 62.
```
If the palace is unreachable (fresh install, moved, permission-denied) the wrapper falls back to "everything is new" — the real mine step delegates dedup to `mempalace mine --mode convos`, which is always the source of truth. So running `mempalace-session` twice in a row is never destructive or wasteful: the second run's only cost is the post-mine HNSW `repair` step (~5 min on a ~5k-drawer palace).
**Filter:** sessions with fewer than `--min-messages` messages (default 3) are skipped — drops throwaway `/exit`'d sessions that would otherwise flood the palace. On a reference 140-session corpus, 78 were filtered this way.
**Cost profile:** ~20 minutes per 60-session batch. Scales roughly linearly with message count. Dedup re-run: mine step instant, only the post-mine `repair` runs (~5 min on 5k drawers).