LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Ziyuan He; Yuxuan Wang; Jiaqi Li; Kexin Liang; Muhan Zhang

arXiv:2510.22548·cs.CL·October 28, 2025

LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Ziyuan He, Yuxuan Wang, Jiaqi Li, Kexin Liang, Muhan Zhang

PDF

1 Datasets

TL;DR

This paper introduces LooGLE v2, a benchmark for evaluating large language models' ability to understand and process long dependencies in real-world scenarios, revealing significant limitations despite extended context windows.

Contribution

The paper presents a new benchmark with real-world long texts and diverse tasks to assess LLMs' long-context understanding, highlighting their current limitations.

Findings

01

Best models score only 59.2% on the benchmark

02

Popular LLMs understand much shorter contexts than claimed

03

Significant room for improvement in long-context understanding

Abstract

Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MuLabPKU/LooGLE-v2
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.