Michelangelo: Long Context Evaluations Beyond Haystacks via Latent   Structure Queries

Kiran Vodrahalli; Santiago Ontanon; Nilesh Tripuraneni; Kelvin Xu,; Sanil Jain; Rakesh Shivanna; Jeffrey Hui; Nishanth Dikkala; Mehran Kazemi,; Bahare Fatemi; Rohan Anil; Ethan Dyer; Siamak Shakeri; Roopali Vij; Harsh; Mehta; Vinay Ramasesh; Quoc Le; Ed Chi; Yifeng Lu; Orhan Firat; Angeliki; Lazaridou; Jean-Baptiste Lespiau; Nithya Attaluri; and Kate Olszewska

arXiv:2409.12640·cs.CL·September 23, 2024

Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu,, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi,, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh, Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu

PDF

Open Access 3 Models 5 Datasets

TL;DR

Michelangelo introduces a novel framework for evaluating large language models' ability to understand and manipulate long contexts by revealing latent structures, providing more meaningful diagnostics than traditional retrieval tasks.

Contribution

The paper presents the Latent Structure Queries framework, a new method for creating long-context evaluation tasks that measure deep understanding beyond simple retrieval.

Findings

01

Evaluations are high-signal and effective for assessing long-context understanding.

02

State-of-the-art models show significant room for improvement in long-context reasoning.

03

The framework applies across code and natural language domains.

Abstract

We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Image Retrieval and Classification Techniques · Natural Language Processing Techniques