mempalace-session: make --dry-run dedup-aware
A --dry-run report showed all qualifying sessions without indicating
which would actually hit the palace on a real run. On a second run
against an already-mined corpus this was misleading — output said
'Exported 62 session(s)' but the real mine step would skip all 62.
The wrapper now queries the palace's chroma.sqlite3 (read-only, via
file:...?mode=ro URI) for source_file values under the staging dir,
then tags each exported session as [NEW] or [SKIP] during listing and
reports the split in the summary:
Exported 62 session(s) to ~/.cache/mempalace-session/wing_conversations
0 new → will be filed on mine
62 already filed → will be skipped (dedup by source_file)
--dry-run: no new sessions to mine. A real run would skip all 62.
Implementation notes:
- Classification is best-effort. If the palace is unreachable (fresh
install, moved, permission-denied, file missing) the wrapper falls
back to treating all exports as NEW — the real mine step still
delegates dedup to 'mempalace mine --mode convos' which is the
authoritative source of truth. Getting the classification wrong
in --dry-run is cosmetic; behaviour of a real run is unchanged.
- Palace path respects $MEMPALACE_PATH env var for non-default setups.
- Same classification also shown on a real (non-dry-run) mine so users
see upfront how much of the export set is actually new before the
miner runs.
Verified both directions:
- All-already-filed case (current box, 62 sessions in palace): reports
0 new, 62 skipped. --dry-run message correctly says 'would skip all'.
- Partial case (simulated by deleting one session's metadata from
palace): reports 1 new, 61 skipped. --dry-run message correctly
says 'would file 1 new'. Palace was restored from backup
immediately after the test.
README and SKILL.md both updated with the new dedup-aware output and
a direct answer to the FAQ 'will it mine the same sessions again?'
This commit is contained in:
+59
-6
@@ -66,10 +66,19 @@ Options:
|
||||
--agent <name> Agent name recorded on drawers (default: $USER)
|
||||
--db <path> Path to opencode.db (default: $OPENCODE_DB or
|
||||
~/.local/share/opencode/opencode.db)
|
||||
--dry-run Export + list; do not mine into palace
|
||||
--dry-run Export + list; do not mine into palace. Each session
|
||||
is tagged [NEW] or [SKIP] based on whether its
|
||||
source_file is already present in the palace.
|
||||
--no-repair Skip `mempalace repair` after mining
|
||||
-h, --help Show this help
|
||||
|
||||
Idempotency:
|
||||
Re-running on the same corpus is safe. The export step always writes every
|
||||
qualifying session to the cache, but the mine step dedups on source_file
|
||||
path — already-filed sessions are skipped without re-embedding. A --dry-run
|
||||
summary shows exactly how many of the exported files are new vs already
|
||||
filed, so you can see in advance what a real run would do.
|
||||
|
||||
What gets mined:
|
||||
- Each qualifying session → one Claude Code JSONL file
|
||||
- Staged under ~/.cache/mempalace-session/<wing>/
|
||||
@@ -139,6 +148,12 @@ mkdir -p "$STAGE"
|
||||
# ── Export sessions (Python heredoc) ────────────────────────────────
|
||||
# Writes one JSONL file per qualifying session into $STAGE.
|
||||
# Prints: EXPORTED <count> on stdout, plus per-session lines.
|
||||
#
|
||||
# If the palace is reachable, also classifies each export as NEW or ALREADY
|
||||
# FILED (matching by source_file path) so --dry-run can report the true
|
||||
# mine-set size, not just the export-set size. Classification is advisory
|
||||
# only — the real mine step delegates dedup to `mempalace mine --mode convos`,
|
||||
# which is the authoritative source of truth.
|
||||
export_count=$(python3 - "$OPENCODE_DB" "$STAGE" "$SESSION_ID" "$SINCE" "$MIN_MESSAGES" <<'PY'
|
||||
import sqlite3, json, sys, os
|
||||
from datetime import datetime, timezone
|
||||
@@ -157,6 +172,28 @@ if since:
|
||||
print(f"error: --since must be YYYY-MM-DD, got {since!r}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# ── Load palace's already-filed source_files (best-effort, read-only) ──
|
||||
# Key the dedup check on absolute staging path. The palace stores these in
|
||||
# chroma.sqlite3 under embedding_metadata.key='source_file'. If the palace
|
||||
# isn't reachable (first install, moved, permission-denied), we fall through
|
||||
# to "everything is new" — the mine step will do the real dedup anyway.
|
||||
already_filed = set()
|
||||
palace_path = os.environ.get("MEMPALACE_PATH", os.path.expanduser("~/.mempalace/palace"))
|
||||
chroma_db = Path(palace_path) / "chroma.sqlite3"
|
||||
if chroma_db.is_file():
|
||||
try:
|
||||
pcon = sqlite3.connect(f"file:{chroma_db}?mode=ro", uri=True)
|
||||
for (sf,) in pcon.execute(
|
||||
"SELECT DISTINCT string_value FROM embedding_metadata "
|
||||
"WHERE key='source_file' AND string_value LIKE ?",
|
||||
(f"{stage}%",),
|
||||
):
|
||||
if sf:
|
||||
already_filed.add(sf)
|
||||
pcon.close()
|
||||
except sqlite3.Error:
|
||||
pass # palace unreachable → treat all exports as new (miner will dedup)
|
||||
|
||||
conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
|
||||
conn.row_factory = sqlite3.Row
|
||||
cur = conn.cursor()
|
||||
@@ -181,6 +218,7 @@ if not sessions:
|
||||
# Prefetch messages + parts for qualifying sessions
|
||||
exported = 0
|
||||
skipped_short = 0
|
||||
skipped_already_filed = 0
|
||||
for sess in sessions:
|
||||
sid = sess["id"]
|
||||
cur.execute("SELECT COUNT(*) FROM message WHERE session_id=?", (sid,))
|
||||
@@ -303,19 +341,26 @@ for sess in sessions:
|
||||
pass
|
||||
|
||||
exported += 1
|
||||
print(f" {out_path.name} ({msg_count} msgs, {len(out_lines)} turns)",
|
||||
is_filed = str(out_path) in already_filed
|
||||
if is_filed:
|
||||
skipped_already_filed += 1
|
||||
status = "SKIP" if is_filed else "NEW "
|
||||
print(f" [{status}] {out_path.name} ({msg_count} msgs, {len(out_lines)} turns)",
|
||||
file=sys.stderr)
|
||||
|
||||
print(f"EXPORTED {exported}")
|
||||
print(f"ALREADY_FILED {skipped_already_filed}")
|
||||
if skipped_short:
|
||||
print(f"SKIPPED_SHORT {skipped_short}", file=sys.stderr)
|
||||
PY
|
||||
)
|
||||
|
||||
# Parse count from stdout
|
||||
count="${export_count##*EXPORTED }"
|
||||
count="${count%%[!0-9]*}"
|
||||
# Parse counts from stdout
|
||||
count="$(printf '%s\n' "$export_count" | awk '/^EXPORTED / { print $2 }')"
|
||||
count="${count:-0}"
|
||||
already_filed="$(printf '%s\n' "$export_count" | awk '/^ALREADY_FILED / { print $2 }')"
|
||||
already_filed="${already_filed:-0}"
|
||||
to_file=$(( count - already_filed ))
|
||||
|
||||
if [[ "$count" -eq 0 ]]; then
|
||||
echo "no sessions qualified for export"
|
||||
@@ -324,9 +369,17 @@ fi
|
||||
|
||||
echo ""
|
||||
echo "Exported $count session(s) to $STAGE"
|
||||
echo " $to_file new → will be filed on mine"
|
||||
echo " $already_filed already filed → will be skipped (dedup by source_file)"
|
||||
|
||||
if [[ $DRY_RUN -eq 1 ]]; then
|
||||
echo "--dry-run: skipping mine step"
|
||||
if [[ "$to_file" -eq 0 ]]; then
|
||||
echo ""
|
||||
echo "--dry-run: no new sessions to mine. A real run would skip all $count."
|
||||
else
|
||||
echo ""
|
||||
echo "--dry-run: skipping mine step. A real run would file $to_file new session(s)."
|
||||
fi
|
||||
exit 0
|
||||
fi
|
||||
|
||||
|
||||
Reference in New Issue
Block a user