Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks
Zi Yang

TL;DR
This paper introduces the Dolce framework to categorize and measure the difficulty of long context understanding tasks in language models, distinguishing between retrieval and holistic understanding capabilities.
Contribution
The paper proposes a novel parameterization and sampling method to automatically identify focus categories and difficulty levels in long context tasks.
Findings
0% to 67% of problems are retrieval focused
0% to 90% of problems are holistic understanding focused
Effective categorization across 44 benchmark tasks
Abstract
We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs' long context capabilities would not be possible without knowing the tasks' focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by (complexity) and (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the and for each problem, we further propose a mixture model of a non-parametric background noise component and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvaluation and Performance Assessment
MethodsFocus
