mempalace-session: make --dry-run dedup-aware

A --dry-run report showed all qualifying sessions without indicating
which would actually hit the palace on a real run. On a second run
against an already-mined corpus this was misleading — output said
'Exported 62 session(s)' but the real mine step would skip all 62.

The wrapper now queries the palace's chroma.sqlite3 (read-only, via
file:...?mode=ro URI) for source_file values under the staging dir,
then tags each exported session as [NEW] or [SKIP] during listing and
reports the split in the summary:

  Exported 62 session(s) to ~/.cache/mempalace-session/wing_conversations
    0 new   → will be filed on mine
    62 already filed → will be skipped (dedup by source_file)

  --dry-run: no new sessions to mine. A real run would skip all 62.

Implementation notes:
- Classification is best-effort. If the palace is unreachable (fresh
  install, moved, permission-denied, file missing) the wrapper falls
  back to treating all exports as NEW — the real mine step still
  delegates dedup to 'mempalace mine --mode convos' which is the
  authoritative source of truth. Getting the classification wrong
  in --dry-run is cosmetic; behaviour of a real run is unchanged.
- Palace path respects $MEMPALACE_PATH env var for non-default setups.
- Same classification also shown on a real (non-dry-run) mine so users
  see upfront how much of the export set is actually new before the
  miner runs.

Verified both directions:
- All-already-filed case (current box, 62 sessions in palace): reports
  0 new, 62 skipped. --dry-run message correctly says 'would skip all'.
- Partial case (simulated by deleting one session's metadata from
  palace): reports 1 new, 61 skipped. --dry-run message correctly
  says 'would file 1 new'. Palace was restored from backup
  immediately after the test.

README and SKILL.md both updated with the new dedup-aware output and
a direct answer to the FAQ 'will it mine the same sessions again?'
This commit is contained in:
Joakim Persson
2026-04-30 08:33:36 +00:00
parent 72e7019101
commit 349a3a3d3d
3 changed files with 81 additions and 6 deletions
+59 -6
View File
@@ -66,10 +66,19 @@ Options:
--agent <name> Agent name recorded on drawers (default: $USER)
--db <path> Path to opencode.db (default: $OPENCODE_DB or
~/.local/share/opencode/opencode.db)
--dry-run Export + list; do not mine into palace
--dry-run Export + list; do not mine into palace. Each session
is tagged [NEW] or [SKIP] based on whether its
source_file is already present in the palace.
--no-repair Skip `mempalace repair` after mining
-h, --help Show this help
Idempotency:
Re-running on the same corpus is safe. The export step always writes every
qualifying session to the cache, but the mine step dedups on source_file
path — already-filed sessions are skipped without re-embedding. A --dry-run
summary shows exactly how many of the exported files are new vs already
filed, so you can see in advance what a real run would do.
What gets mined:
- Each qualifying session → one Claude Code JSONL file
- Staged under ~/.cache/mempalace-session/<wing>/
@@ -139,6 +148,12 @@ mkdir -p "$STAGE"
# ── Export sessions (Python heredoc) ────────────────────────────────
# Writes one JSONL file per qualifying session into $STAGE.
# Prints: EXPORTED <count> on stdout, plus per-session lines.
#
# If the palace is reachable, also classifies each export as NEW or ALREADY
# FILED (matching by source_file path) so --dry-run can report the true
# mine-set size, not just the export-set size. Classification is advisory
# only — the real mine step delegates dedup to `mempalace mine --mode convos`,
# which is the authoritative source of truth.
export_count=$(python3 - "$OPENCODE_DB" "$STAGE" "$SESSION_ID" "$SINCE" "$MIN_MESSAGES" <<'PY'
import sqlite3, json, sys, os
from datetime import datetime, timezone
@@ -157,6 +172,28 @@ if since:
print(f"error: --since must be YYYY-MM-DD, got {since!r}", file=sys.stderr)
sys.exit(1)
# ── Load palace's already-filed source_files (best-effort, read-only) ──
# Key the dedup check on absolute staging path. The palace stores these in
# chroma.sqlite3 under embedding_metadata.key='source_file'. If the palace
# isn't reachable (first install, moved, permission-denied), we fall through
# to "everything is new" — the mine step will do the real dedup anyway.
already_filed = set()
palace_path = os.environ.get("MEMPALACE_PATH", os.path.expanduser("~/.mempalace/palace"))
chroma_db = Path(palace_path) / "chroma.sqlite3"
if chroma_db.is_file():
try:
pcon = sqlite3.connect(f"file:{chroma_db}?mode=ro", uri=True)
for (sf,) in pcon.execute(
"SELECT DISTINCT string_value FROM embedding_metadata "
"WHERE key='source_file' AND string_value LIKE ?",
(f"{stage}%",),
):
if sf:
already_filed.add(sf)
pcon.close()
except sqlite3.Error:
pass # palace unreachable → treat all exports as new (miner will dedup)
conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
conn.row_factory = sqlite3.Row
cur = conn.cursor()
@@ -181,6 +218,7 @@ if not sessions:
# Prefetch messages + parts for qualifying sessions
exported = 0
skipped_short = 0
skipped_already_filed = 0
for sess in sessions:
sid = sess["id"]
cur.execute("SELECT COUNT(*) FROM message WHERE session_id=?", (sid,))
@@ -303,19 +341,26 @@ for sess in sessions:
pass
exported += 1
print(f" {out_path.name} ({msg_count} msgs, {len(out_lines)} turns)",
is_filed = str(out_path) in already_filed
if is_filed:
skipped_already_filed += 1
status = "SKIP" if is_filed else "NEW "
print(f" [{status}] {out_path.name} ({msg_count} msgs, {len(out_lines)} turns)",
file=sys.stderr)
print(f"EXPORTED {exported}")
print(f"ALREADY_FILED {skipped_already_filed}")
if skipped_short:
print(f"SKIPPED_SHORT {skipped_short}", file=sys.stderr)
PY
)
# Parse count from stdout
count="${export_count##*EXPORTED }"
count="${count%%[!0-9]*}"
# Parse counts from stdout
count="$(printf '%s\n' "$export_count" | awk '/^EXPORTED / { print $2 }')"
count="${count:-0}"
already_filed="$(printf '%s\n' "$export_count" | awk '/^ALREADY_FILED / { print $2 }')"
already_filed="${already_filed:-0}"
to_file=$(( count - already_filed ))
if [[ "$count" -eq 0 ]]; then
echo "no sessions qualified for export"
@@ -324,9 +369,17 @@ fi
echo ""
echo "Exported $count session(s) to $STAGE"
echo " $to_file new → will be filed on mine"
echo " $already_filed already filed → will be skipped (dedup by source_file)"
if [[ $DRY_RUN -eq 1 ]]; then
echo "--dry-run: skipping mine step"
if [[ "$to_file" -eq 0 ]]; then
echo ""
echo "--dry-run: no new sessions to mine. A real run would skip all $count."
else
echo ""
echo "--dry-run: skipping mine step. A real run would file $to_file new session(s)."
fi
exit 0
fi