LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto

TL;DR
This paper introduces LongCodeBench, a new benchmark for evaluating long-context capabilities of coding language models using real-world code comprehension and repair tasks, revealing that current models struggle with extended contexts.
Contribution
The paper presents LongCodeBench, a realistic long-context coding benchmark, and evaluates multiple models, highlighting their weaknesses in handling extended context lengths.
Findings
Long-context remains a challenge for all tested models.
Performance drops significantly with increasing context length.
Benchmark is publicly available for future research.
Abstract
Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
