Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

Yijiong Yu; Yongfeng Huang; Zhixiao Qi; Wei Wang; Weifeng Liu; Ran Chen; Ji Pei

arXiv:2410.04422·cs.CL·August 27, 2025

Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

Yijiong Yu, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, Ran Chen, Ji Pei

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Long-context language models excel at retrieval but struggle with basic tasks without explicit reasoning steps, highlighting the importance of long-chain reasoning prompts for effective performance.

Contribution

This paper reveals the limitations of long-context language models in basic retrieval tasks and demonstrates the effectiveness of long-chain reasoning prompts to overcome these issues.

Findings

01

Models fail in basic retrieval tasks without sufficient reasoning.

02

Adding reasoning steps improves model performance significantly.

03

Long-CoT methods are essential for solving long-context tasks effectively.

Abstract

Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases. Later, we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts. This result emphasizes the potential necessity of solving specific long-context tasks using long-CoT methods, while previous long-context benchmarks always ignore the necessity of long reasoning for long-context tasks and treat them as direct QA tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuyijiong/hard_retrieval_for_llm
pytorchOfficial

Datasets

yuyijiong/difficult_retrieval
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications