Needle Threading: Can LLMs Follow Threads through Near-Million-Scale   Haystacks?

Jonathan Roberts; Kai Han; Samuel Albanie

arXiv:2411.05000·cs.CL·April 24, 2025

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Jonathan Roberts, Kai Han, Samuel Albanie

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper evaluates 17 leading LLMs on their ability to follow multiple information threads within long contexts, revealing strengths in multitasking but also limitations in effective context length and tokenizer differences.

Contribution

It provides a comprehensive empirical analysis of LLMs' thread-following capabilities in long contexts, highlighting their multitasking strengths and context length limitations.

Findings

01

Many models are capable of following multiple threads simultaneously.

02

Effective context length is often shorter than the maximum supported context.

03

Tokenizer differences significantly affect token count and model performance.

Abstract

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jonathan-roberts1/needle-threading
dataset· 39 dl
39 dl

Videos

Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks?· slideslive

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory

MethodsSparse Evolutionary Training