Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Jonathan Roberts, Kai Han, Samuel Albanie

TL;DR
This paper evaluates 17 leading LLMs on their ability to follow multiple information threads within long contexts, revealing strengths in multitasking but also limitations in effective context length and tokenizer differences.
Contribution
It provides a comprehensive empirical analysis of LLMs' thread-following capabilities in long contexts, highlighting their multitasking strengths and context length limitations.
Findings
Many models are capable of following multiple threads simultaneously.
Effective context length is often shorter than the maximum supported context.
Tokenizer differences significantly affect token count and model performance.
Abstract
As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory
MethodsSparse Evolutionary Training
