Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints
Shweta Mishra

TL;DR
This paper introduces a fast, correctness-aware filtering framework for large repositories that significantly reduces context size while maintaining accuracy, enabling more efficient LLM-based developer tools.
Contribution
It proposes a size-based heuristic filter that operates at OS level without indexing, achieving high token reduction and accuracy in repository filtering.
Findings
79.6% token reduction at 0.30 ms overhead with SizeFilter
89.3% token reduction with HybridFilter, lowest variance
72% file-level accuracy in filtering, reducing hallucinations from 61% to 17%
Abstract
Context window efficiency is a practical constraint in large language model (LLM)-based developer tools. Paulsen [12] shows that all tested models degrade in accuracy well before their advertised context limits the Maximum Effective Context Window (MECW) which makes context construction a quality problem, not just a cost one. Modern software repositories routinely contain large non-code artifacts compiled datasets, binary model weights, minified JavaScript bundles, and gigabyte-scale log files that overflow the context window and push out task-relevant source code. We present a correctness-aware context hygiene framework: a pre-execution, size-based heuristic filter that intercepts repository scans before tokenization, using only OS-level stat() metadata with sub-millisecond overhead. Semantic retrieval approaches such as RepoCoder, GraphRAG, and AST-based chunking require index…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
