SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels
Guancheng Du, Yong Hu, Wenqing Wang, Yaming Yang, Jiaheng Gao

TL;DR
SagaScale introduces a large, realistic, bilingual long-context benchmark from full-length novels, enabling better evaluation of LLMs' ability to handle extensive, complex documents with high data quality and scalability.
Contribution
It presents SagaScale, a novel, scalable, high-quality long-context benchmark built from full-length novels, with an automated data pipeline and bilingual support, surpassing existing benchmarks in size and realism.
Findings
Full context input improves LLM performance significantly.
Gemini-2.5-Pro handles lengthy contexts better than others.
Agentic RAG mitigates retrieval bottleneck effectively.
Abstract
Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K…
Peer Reviews
Decision·Submitted to ICLR 2026
- The benchmark offers realistic, high-quality QA tasks. - Presents a rigorous, multi-stage QA generation and filtering pipeline to ensure quality. - Comprehensive evaluation including native RAG, agentic RAG, and long-context processing. - Clear and well-structured presentation facilitates understanding. These elements are crucial, as they significantly enhance the benchmark's realism and practical utility, enabling more effective evaluation of long-context language models, which are highly re
- Lack of transparency regarding the total dataset construction cost undermines one of the claimed benefits—cost efficiency compared to human annotation. There should be a more detailed comparison to the previous benchmark on the cost and scalability. - Insufficient analysis of QA types: It remains unclear what capabilities models need to correctly answer questions—whether questions typically require retrieving information from a single segment, integrating multiple segments, or performing mult
1. This benchmark includes unprecedented context lengths, bilingual support, and a scalable, automated pipeline for generating high-quality question-answer pairs. 2. Beyond just presenting a new dataset, the paper provides a rigorous evaluation of a wide range of state-of-the-art LLMs and three distinct long-context methods. This analysis establishes valuable baselines and offers crucial insights into the current landscape.
1. A key limitation is the narrow scope of the dataset, which consists solely of fictional novels. This domain is not representative of common long-context tasks in practice, which raises concerns about the practical relevance and generalizability of the evaluation. 2. During the QA generation phase, all question-answer pairs are generated by a single model, DeepSeek-R1. This could lead to a lack of diversity in the QAs and potentially create a bias that favors DeepSeek-R1's own evaluation. An i
**1. The benchmark provides genuinely ultra-long context evaluation.** With English novels averaging over 250K tokens and Chinese novels exceeding 320K tokens—some reaching 800K+ tokens—SagaScale represents one of the longest context benchmarks currently available. This is a meaningful contribution given the rapid advancement of long-context modeling capabilities in modern LLMs, and the bilingual coverage across English and Chinese adds valuable cross-lingual evaluation capacity that remains rel
**1. The literature review is severely inadequate, undermining the paper's fundamental motivation.** The authors claim existing benchmarks suffer from "insufficient task realism," "limited data scalability," and "data quality issues," but these assertions lack substantive support and mischaracterize the current state of the field. HELMET [1] already provides a comprehensive evaluation across seven diverse, application-centered categories, including multi-hop reasoning, temporal reasoning, and en
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Text Readability and Simplification
