If you use Claude heavily, you've felt this: every session starts from zero. You re-explain context, Claude helps, the window closes, and the next session has no idea what you decided yesterday. The standard workaround is a markdown wiki Claude reads — but as the wiki grows, every "what did we decide about X" question burns thousands of tokens grepping and re-reading whole pages.

I spent the last few weeks building a persistent memory layer to fix both problems. It runs entirely on my own machine, integrates via MCP, and lives between Claude and my existing wiki. Sharing the architecture and what I learned in case anyone wants to build their own.

What it does Semantic retrieval over my wiki. Instead of Claude grepping pages, my MCP server returns the most relevant chunks for any query in ~50ms. 82% mean token reduction on a 10-query eval set vs the grep+Read baseline. F1 retrieval quality is also better — cheaper and more accurate.

Session crystallization. End-of-session, conversations get compressed into a structured "L4 node" with summary + decisions + open threads, indexed alongside wiki content. Tomorrow I can ask "what did we decide about X" and Claude pulls last session's decision verbatim.

Lazy-spawned local models. Embedder + chat model run as subprocesses that the supervisor spawns on first use and reaps after 1 hour idle. Boot cost is zero — nothing loaded until needed.

The architecture (four layers) Inspired by Andrej Karpathy's writing on LLM-native wikis, then formalized into a build spec:

L0 — append-only event log (SQLite). Every input/output, content-hashed.

L1 — structured facts with confidence + decay (deferred to next phase)

L2/L3 — derived prose + cross-cutting summaries (the hand-edited wiki plays this role for now)

L4 — crystallized session nodes. Summary, decisions, open threads. Indexed in the same vector store as wiki chunks so retrieval finds both naturally.

The stack Qdrant in Docker for vector search

llama.cpp running Qwen3-Embedding-4B (GPU) and Qwen3.5-2B-Q4_K_M (CPU)

FastMCP server exposing 7 tools ( retrieve , crystallize_session , list_sessions , get_l4_node , index_status , reindex , shutdown_models )

Cowork plugin for Claude Desktop integration; also works with Claude Code via standard MCP config

No cloud, no API keys, $0 marginal cost per query.

Numbers Token reduction: 82.7% mean, 86.2% median vs grep+Read baseline

Retrieval F1: 0.50 vs 0.20 baseline

Embed cold-start: ~4s. Hot-path p95: 39ms (was 2241ms before fixing one specific bug — see below)

L4 session retrieval eval: 0.920 mean score (gate 0.6)

738 chunks currently indexed across 104 markdown files

The most useful thing I learned Hot-path retrieve was inexplicably stuck at 2241ms p95 even though the embedding model was fully GPU-resident on a 4070 Ti Super. Spent hours blaming GPU offload, prompt cache, KV pre-allocation. The actual cause: every httpx.post() was opening a fresh TCP connection, and Windows localhost handshakes take ~2 seconds. A 5-line ch…

为什么值得关注

有直接可用的方法、工具或操作价值;符合当前抓取需求;它有实际可用价值,可以直接迁移到方法、工具或工作流

来源:reddit,领域:tech,保留分:0.61