Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
Eric H. C. Chow

TL;DR
This study evaluates large language models' ability to retrieve and reason over 1 million tokens of classical Chinese text, revealing varied performance patterns in multi-hop reasoning at extreme context lengths.
Contribution
It provides the first comprehensive assessment of LLMs' long-context retrieval and multi-hop reasoning capabilities at 1M tokens, identifying distinct degradation regimes.
Findings
Strong models achieve 100% retrieval accuracy at 1M tokens.
Multi-hop reasoning performance varies, with some models maintaining >80% accuracy up to 512K tokens.
Performance sharply declines or gradually degrades beyond 512K tokens depending on the model.
Abstract
We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
