Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

Shweta Mishra

arXiv:2605.14362·cs.SE·May 15, 2026

Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

Shweta Mishra

PDF

TL;DR

This paper introduces a fast, correctness-aware filtering framework for large repositories that significantly reduces context size while maintaining accuracy, enabling more efficient LLM-based developer tools.

Contribution

It proposes a size-based heuristic filter that operates at OS level without indexing, achieving high token reduction and accuracy in repository filtering.

Findings

01

79.6% token reduction at 0.30 ms overhead with SizeFilter

02

89.3% token reduction with HybridFilter, lowest variance

03

72% file-level accuracy in filtering, reducing hallucinations from 61% to 17%

Abstract

Context window efficiency is a practical constraint in large language model (LLM)-based developer tools. Paulsen [12] shows that all tested models degrade in accuracy well before their advertised context limits the Maximum Effective Context Window (MECW) which makes context construction a quality problem, not just a cost one. Modern software repositories routinely contain large non-code artifacts compiled datasets, binary model weights, minified JavaScript bundles, and gigabyte-scale log files that overflow the context window and push out task-relevant source code. We present a correctness-aware context hygiene framework: a pre-execution, size-based heuristic filter that intercepts repository scans before tokenization, using only OS-level stat() metadata with sub-millisecond overhead. Semantic retrieval approaches such as RepoCoder, GraphRAG, and AST-based chunking require index…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.