Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

Eric H. C. Chow

arXiv:2605.02173·cs.AI·May 5, 2026

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

Eric H. C. Chow

PDF

TL;DR

This study evaluates large language models' ability to retrieve and reason over 1 million tokens of classical Chinese text, revealing varied performance patterns in multi-hop reasoning at extreme context lengths.

Contribution

It provides the first comprehensive assessment of LLMs' long-context retrieval and multi-hop reasoning capabilities at 1M tokens, identifying distinct degradation regimes.

Findings

01

Strong models achieve 100% retrieval accuracy at 1M tokens.

02

Multi-hop reasoning performance varies, with some models maintaining >80% accuracy up to 512K tokens.

03

Performance sharply declines or gradually degrades beyond 512K tokens depending on the model.

Abstract

We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.