CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories
Divya Chukkapalli, Thejesh Avula, Aditya Aggarwal, Harsimran Singh, Amith Tallanki

TL;DR
CommitDistill is a lightweight, deterministic memory layer that extracts and surfaces typed knowledge from software repositories' git history, improving retrieval efficiency for code-related queries.
Contribution
It introduces a novel, open-source Python prototype that mines git history into typed knowledge units using deterministic regex and a TF-IDF retriever, without external dependencies.
Findings
Achieves 0.750 hit-rate on a 12-query benchmark with budget-constrained retrieval.
Outperforms BM25 and git log --grep in retrieval accuracy.
Extraction completes in under 4 seconds on a standard laptop.
Abstract
Software repositories accumulate large amounts of unstructured knowledge in commit messages, pull-request discussions, and issue threads, but developers and AI coding assistants rarely reuse this history effectively. Recent work on typed-memory architectures for LLM agents (MemGPT, generative agents, and the PlugMem module of Yang et al.) argues that agent memory should be distilled, typed knowledge rather than raw interaction text. We adapt that stance to a software repository's own git history under a constrained regime: deterministic, dependency-free, local-only, no embeddings. We present CommitDistill, an open-source Python prototype that mines a local git history into typed knowledge units (Facts, Skills, Patterns) using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold (theta = 2.5) that abstains on out-of-distribution queries.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
