Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley

TL;DR
Oolong is a benchmark designed to evaluate long-context reasoning and aggregation capabilities in models, highlighting current limitations even in state-of-the-art models with extensive context lengths.
Contribution
The paper introduces Oolong, a new benchmark for long-context reasoning tasks that require analyzing and aggregating information from large text chunks, including synthetic and real-world data.
Findings
Frontier models perform poorly on Oolong, with less than 50% accuracy at 128K context length.
Oolong challenges models to perform classification, counting, and reasoning over temporal and user relations.
The authors release data and tools to foster development of better long-context reasoning models.
Abstract
As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world…
Peer Reviews
Decision·Submitted to ICLR 2026
* Practical Significance: Proposed benchmark tries to move beyond common needle-in-a-haystack or generic summarization setups, targeting fine-grained analysis and aggregation. These areas underrepresented in existing long-context benchmarks. * Originality: Requires atomic reasoning, counting/classification, and temporal/user-relational aggregation rather than simple retrieval, reducing shortcut solutions. * Realistic long-horizon data: Uses Dungeons & Dragons campaign transcripts, which feature
* Limited experimental validation of claims: The paper argues Oolong differs from retrieval-centric long-context benchmarks by requiring multi-chunk analysis and logical aggregation, but provides no targeted experiments to substantiate this. To demonstrate the distinction empirically, the evaluation should include not only base LLMs but also strong single- and multi-step RAG baselines. Divergent performance of such RAG systems on Oolong vs. needle-in-a-haystack / multi-hop retrieval tasks would
- Oolong provides two challenging sets of tasks that underscore the limitations of LLMs in long-context scenarios. - The sample size is scalable, while context usage remains more dense compared to classic needle-in-a-haystack - based benchmarks. Relevant facts are hard to distinguish from irrelevant ones, which makes the tasks challenging for LLMs. - Following relevant works, Oolong-synth inherits the validation-test structure, with no overlap between them, which allows for more fair evaluation
- Conceptually the contribution of the current work is limited, as LLM degradation in long-context scenarios has been demonstrated by multiple other synthetic and real-world benchmarks. Some of these works were listed by authors in Section 5, but several relevant works that require long-context reasoning with aggregation are missing, including BABILong (Kuratov et al., 2024) and InfinityBench (Zhang et al., 2024). Discussing their differences and limitations compared to Oolong is advised. - If
1. OOLONG frames long-context reasoning as multi-step aggregation: identify relevant spans, classify locally, and pool globally (counts, timelines, user-specific patterns). This is positioned as closer to realistic analytics tasks than classic “needle in a haystack.” 2. OOLONG-synth is controllable: it uses standard classification datasets (spam, sentiment, NLI, etc.) and scales to millions of tokens, letting the authors ablate factors like context length, label access, and time/user structure.
My main issue is novelty. I did not find a clear, convincing argument that OOLONG measures a fundamentally new capability beyond what recent long-context benchmarks and analyses already target. 1. The stated motivation is: most long-context tests are retrieval-style (needle-in-a-haystack, MRCR, etc.), whereas OOLONG requires aggregation/counting over many items. But this does not feel substantially different from what earlier work like BABILong / long multi-step reasoning tasks including object
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
