mempalace-session: make --dry-run dedup-aware
A --dry-run report showed all qualifying sessions without indicating
which would actually hit the palace on a real run. On a second run
against an already-mined corpus this was misleading — output said
'Exported 62 session(s)' but the real mine step would skip all 62.
The wrapper now queries the palace's chroma.sqlite3 (read-only, via
file:...?mode=ro URI) for source_file values under the staging dir,
then tags each exported session as [NEW] or [SKIP] during listing and
reports the split in the summary:
Exported 62 session(s) to ~/.cache/mempalace-session/wing_conversations
0 new → will be filed on mine
62 already filed → will be skipped (dedup by source_file)
--dry-run: no new sessions to mine. A real run would skip all 62.
Implementation notes:
- Classification is best-effort. If the palace is unreachable (fresh
install, moved, permission-denied, file missing) the wrapper falls
back to treating all exports as NEW — the real mine step still
delegates dedup to 'mempalace mine --mode convos' which is the
authoritative source of truth. Getting the classification wrong
in --dry-run is cosmetic; behaviour of a real run is unchanged.
- Palace path respects $MEMPALACE_PATH env var for non-default setups.
- Same classification also shown on a real (non-dry-run) mine so users
see upfront how much of the export set is actually new before the
miner runs.
Verified both directions:
- All-already-filed case (current box, 62 sessions in palace): reports
0 new, 62 skipped. --dry-run message correctly says 'would skip all'.
- Partial case (simulated by deleting one session's metadata from
palace): reports 1 new, 61 skipped. --dry-run message correctly
says 'would file 1 new'. Palace was restored from backup
immediately after the test.
README and SKILL.md both updated with the new dedup-aware output and
a direct answer to the FAQ 'will it mine the same sessions again?'
This commit is contained in:
@@ -276,6 +276,18 @@ mempalace-session --help
|
|||||||
|
|
||||||
**Dedup:** staging at `~/.cache/mempalace-session/<wing>/` with deterministic per-session filenames (`<slug>_<id>.jsonl`). The convos miner keys on `source_file`, so re-runs skip unchanged sessions. To force re-mining a session, delete its JSONL from the staging dir.
|
**Dedup:** staging at `~/.cache/mempalace-session/<wing>/` with deterministic per-session filenames (`<slug>_<id>.jsonl`). The convos miner keys on `source_file`, so re-runs skip unchanged sessions. To force re-mining a session, delete its JSONL from the staging dir.
|
||||||
|
|
||||||
|
**`--dry-run` is dedup-aware.** Each session is tagged `[NEW]` (would be filed) or `[SKIP]` (already in the palace), and the summary breaks down the count:
|
||||||
|
|
||||||
|
```
|
||||||
|
Exported 62 session(s) to ~/.cache/mempalace-session/wing_conversations
|
||||||
|
0 new → will be filed on mine
|
||||||
|
62 already filed → will be skipped (dedup by source_file)
|
||||||
|
|
||||||
|
--dry-run: no new sessions to mine. A real run would skip all 62.
|
||||||
|
```
|
||||||
|
|
||||||
|
If the palace is unreachable (fresh install, moved, permission-denied) the wrapper falls back to "everything is new" — the real mine step delegates dedup to `mempalace mine --mode convos`, which is always the source of truth. So running `mempalace-session` twice in a row is never destructive or wasteful: the second run's only cost is the post-mine HNSW `repair` step (~5 min on a ~5k-drawer palace).
|
||||||
|
|
||||||
**Filter:** sessions with fewer than `--min-messages` messages (default 3) are skipped — drops throwaway `/exit`'d sessions that would otherwise flood the palace. On a reference 140-session corpus, 78 were filtered this way.
|
**Filter:** sessions with fewer than `--min-messages` messages (default 3) are skipped — drops throwaway `/exit`'d sessions that would otherwise flood the palace. On a reference 140-session corpus, 78 were filtered this way.
|
||||||
|
|
||||||
**Cost profile:** ~20 minutes per 60-session batch. Scales roughly linearly with message count. Dedup re-run: mine step instant, only the post-mine `repair` runs (~5 min on 5k drawers).
|
**Cost profile:** ~20 minutes per 60-session batch. Scales roughly linearly with message count. Dedup re-run: mine step instant, only the post-mine `repair` runs (~5 min on 5k drawers).
|
||||||
|
|||||||
@@ -90,6 +90,16 @@ A docs-heavy repo should produce ~5–10 drawers per file. >15 drawers/file on a
|
|||||||
|
|
||||||
Second run immediately after first → 0 new drawers, only the post-mine `repair` step runs (~5 min on 5k drawers).
|
Second run immediately after first → 0 new drawers, only the post-mine `repair` step runs (~5 min on 5k drawers).
|
||||||
|
|
||||||
|
**`mempalace-session --dry-run` is dedup-aware.** Each session listed is tagged `[NEW]` (would be filed) or `[SKIP]` (already in the palace), and the summary reports the split:
|
||||||
|
|
||||||
|
```
|
||||||
|
Exported 62 session(s) to ~/.cache/...
|
||||||
|
0 new → will be filed on mine
|
||||||
|
62 already filed → will be skipped (dedup by source_file)
|
||||||
|
```
|
||||||
|
|
||||||
|
So when a user asks "will it mine the same sessions again?" — point them at `mempalace-session --dry-run` and read the summary line. If `N new = 0`, nothing will be re-filed. The classification check is best-effort (falls back to "everything is new" if palace unreachable); the real mine step delegates to `mempalace mine --mode convos`, which is always the authoritative dedup source.
|
||||||
|
|
||||||
### Incremental catch-up
|
### Incremental catch-up
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
+59
-6
@@ -66,10 +66,19 @@ Options:
|
|||||||
--agent <name> Agent name recorded on drawers (default: $USER)
|
--agent <name> Agent name recorded on drawers (default: $USER)
|
||||||
--db <path> Path to opencode.db (default: $OPENCODE_DB or
|
--db <path> Path to opencode.db (default: $OPENCODE_DB or
|
||||||
~/.local/share/opencode/opencode.db)
|
~/.local/share/opencode/opencode.db)
|
||||||
--dry-run Export + list; do not mine into palace
|
--dry-run Export + list; do not mine into palace. Each session
|
||||||
|
is tagged [NEW] or [SKIP] based on whether its
|
||||||
|
source_file is already present in the palace.
|
||||||
--no-repair Skip `mempalace repair` after mining
|
--no-repair Skip `mempalace repair` after mining
|
||||||
-h, --help Show this help
|
-h, --help Show this help
|
||||||
|
|
||||||
|
Idempotency:
|
||||||
|
Re-running on the same corpus is safe. The export step always writes every
|
||||||
|
qualifying session to the cache, but the mine step dedups on source_file
|
||||||
|
path — already-filed sessions are skipped without re-embedding. A --dry-run
|
||||||
|
summary shows exactly how many of the exported files are new vs already
|
||||||
|
filed, so you can see in advance what a real run would do.
|
||||||
|
|
||||||
What gets mined:
|
What gets mined:
|
||||||
- Each qualifying session → one Claude Code JSONL file
|
- Each qualifying session → one Claude Code JSONL file
|
||||||
- Staged under ~/.cache/mempalace-session/<wing>/
|
- Staged under ~/.cache/mempalace-session/<wing>/
|
||||||
@@ -139,6 +148,12 @@ mkdir -p "$STAGE"
|
|||||||
# ── Export sessions (Python heredoc) ────────────────────────────────
|
# ── Export sessions (Python heredoc) ────────────────────────────────
|
||||||
# Writes one JSONL file per qualifying session into $STAGE.
|
# Writes one JSONL file per qualifying session into $STAGE.
|
||||||
# Prints: EXPORTED <count> on stdout, plus per-session lines.
|
# Prints: EXPORTED <count> on stdout, plus per-session lines.
|
||||||
|
#
|
||||||
|
# If the palace is reachable, also classifies each export as NEW or ALREADY
|
||||||
|
# FILED (matching by source_file path) so --dry-run can report the true
|
||||||
|
# mine-set size, not just the export-set size. Classification is advisory
|
||||||
|
# only — the real mine step delegates dedup to `mempalace mine --mode convos`,
|
||||||
|
# which is the authoritative source of truth.
|
||||||
export_count=$(python3 - "$OPENCODE_DB" "$STAGE" "$SESSION_ID" "$SINCE" "$MIN_MESSAGES" <<'PY'
|
export_count=$(python3 - "$OPENCODE_DB" "$STAGE" "$SESSION_ID" "$SINCE" "$MIN_MESSAGES" <<'PY'
|
||||||
import sqlite3, json, sys, os
|
import sqlite3, json, sys, os
|
||||||
from datetime import datetime, timezone
|
from datetime import datetime, timezone
|
||||||
@@ -157,6 +172,28 @@ if since:
|
|||||||
print(f"error: --since must be YYYY-MM-DD, got {since!r}", file=sys.stderr)
|
print(f"error: --since must be YYYY-MM-DD, got {since!r}", file=sys.stderr)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
# ── Load palace's already-filed source_files (best-effort, read-only) ──
|
||||||
|
# Key the dedup check on absolute staging path. The palace stores these in
|
||||||
|
# chroma.sqlite3 under embedding_metadata.key='source_file'. If the palace
|
||||||
|
# isn't reachable (first install, moved, permission-denied), we fall through
|
||||||
|
# to "everything is new" — the mine step will do the real dedup anyway.
|
||||||
|
already_filed = set()
|
||||||
|
palace_path = os.environ.get("MEMPALACE_PATH", os.path.expanduser("~/.mempalace/palace"))
|
||||||
|
chroma_db = Path(palace_path) / "chroma.sqlite3"
|
||||||
|
if chroma_db.is_file():
|
||||||
|
try:
|
||||||
|
pcon = sqlite3.connect(f"file:{chroma_db}?mode=ro", uri=True)
|
||||||
|
for (sf,) in pcon.execute(
|
||||||
|
"SELECT DISTINCT string_value FROM embedding_metadata "
|
||||||
|
"WHERE key='source_file' AND string_value LIKE ?",
|
||||||
|
(f"{stage}%",),
|
||||||
|
):
|
||||||
|
if sf:
|
||||||
|
already_filed.add(sf)
|
||||||
|
pcon.close()
|
||||||
|
except sqlite3.Error:
|
||||||
|
pass # palace unreachable → treat all exports as new (miner will dedup)
|
||||||
|
|
||||||
conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
|
conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
|
||||||
conn.row_factory = sqlite3.Row
|
conn.row_factory = sqlite3.Row
|
||||||
cur = conn.cursor()
|
cur = conn.cursor()
|
||||||
@@ -181,6 +218,7 @@ if not sessions:
|
|||||||
# Prefetch messages + parts for qualifying sessions
|
# Prefetch messages + parts for qualifying sessions
|
||||||
exported = 0
|
exported = 0
|
||||||
skipped_short = 0
|
skipped_short = 0
|
||||||
|
skipped_already_filed = 0
|
||||||
for sess in sessions:
|
for sess in sessions:
|
||||||
sid = sess["id"]
|
sid = sess["id"]
|
||||||
cur.execute("SELECT COUNT(*) FROM message WHERE session_id=?", (sid,))
|
cur.execute("SELECT COUNT(*) FROM message WHERE session_id=?", (sid,))
|
||||||
@@ -303,19 +341,26 @@ for sess in sessions:
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
exported += 1
|
exported += 1
|
||||||
print(f" {out_path.name} ({msg_count} msgs, {len(out_lines)} turns)",
|
is_filed = str(out_path) in already_filed
|
||||||
|
if is_filed:
|
||||||
|
skipped_already_filed += 1
|
||||||
|
status = "SKIP" if is_filed else "NEW "
|
||||||
|
print(f" [{status}] {out_path.name} ({msg_count} msgs, {len(out_lines)} turns)",
|
||||||
file=sys.stderr)
|
file=sys.stderr)
|
||||||
|
|
||||||
print(f"EXPORTED {exported}")
|
print(f"EXPORTED {exported}")
|
||||||
|
print(f"ALREADY_FILED {skipped_already_filed}")
|
||||||
if skipped_short:
|
if skipped_short:
|
||||||
print(f"SKIPPED_SHORT {skipped_short}", file=sys.stderr)
|
print(f"SKIPPED_SHORT {skipped_short}", file=sys.stderr)
|
||||||
PY
|
PY
|
||||||
)
|
)
|
||||||
|
|
||||||
# Parse count from stdout
|
# Parse counts from stdout
|
||||||
count="${export_count##*EXPORTED }"
|
count="$(printf '%s\n' "$export_count" | awk '/^EXPORTED / { print $2 }')"
|
||||||
count="${count%%[!0-9]*}"
|
|
||||||
count="${count:-0}"
|
count="${count:-0}"
|
||||||
|
already_filed="$(printf '%s\n' "$export_count" | awk '/^ALREADY_FILED / { print $2 }')"
|
||||||
|
already_filed="${already_filed:-0}"
|
||||||
|
to_file=$(( count - already_filed ))
|
||||||
|
|
||||||
if [[ "$count" -eq 0 ]]; then
|
if [[ "$count" -eq 0 ]]; then
|
||||||
echo "no sessions qualified for export"
|
echo "no sessions qualified for export"
|
||||||
@@ -324,9 +369,17 @@ fi
|
|||||||
|
|
||||||
echo ""
|
echo ""
|
||||||
echo "Exported $count session(s) to $STAGE"
|
echo "Exported $count session(s) to $STAGE"
|
||||||
|
echo " $to_file new → will be filed on mine"
|
||||||
|
echo " $already_filed already filed → will be skipped (dedup by source_file)"
|
||||||
|
|
||||||
if [[ $DRY_RUN -eq 1 ]]; then
|
if [[ $DRY_RUN -eq 1 ]]; then
|
||||||
echo "--dry-run: skipping mine step"
|
if [[ "$to_file" -eq 0 ]]; then
|
||||||
|
echo ""
|
||||||
|
echo "--dry-run: no new sessions to mine. A real run would skip all $count."
|
||||||
|
else
|
||||||
|
echo ""
|
||||||
|
echo "--dry-run: skipping mine step. A real run would file $to_file new session(s)."
|
||||||
|
fi
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user