ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation
Jiseung Hong, Benjamin G. Ascoli, Jinho D. Choi

TL;DR
ReCUBE is a benchmark designed to evaluate how effectively large language models utilize repository-level context during code generation, highlighting current challenges and proposing tools to improve exploration.
Contribution
The paper introduces ReCUBE, a novel benchmark for measuring repository-level context utilization, and the CCE toolkit to enhance agent exploration in code generation tasks.
Findings
State-of-the-art models struggle with repository-level context, with GPT-5 achieving only 37.57% pass rate.
The CCE toolkit improves exploration efficiency, increasing pass rates by up to 7.56%.
Repository context utilization remains a significant challenge for current LLMs.
Abstract
Large Language Models (LLMs) have recently emerged as capable coding assistants that operate over large codebases through either agentic exploration or full-context generation. Existing benchmarks capture a broad range of coding capabilities, such as resolving GitHub issues, but none of them directly isolate and measure how effectively LLMs leverage repository-level context during code generation. To address this, we introduce ReCUBE, a benchmark in which LLMs reconstruct a masked file within a real-world repository, using all remaining source files, dependency specifications, and documentation as their only source of context. ReCUBE evaluates reconstructed code with usage-aware test cases that simulate both internal module logic and external cross-file integration, reflecting real-world software usage patterns. We further propose the Caller-Centric Exploration (CCE) toolkit, a set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
