Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps
Yijiong Yu, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, Ran Chen, Ji Pei

TL;DR
Long-context language models excel at retrieval but struggle with basic tasks without explicit reasoning steps, highlighting the importance of long-chain reasoning prompts for effective performance.
Contribution
This paper reveals the limitations of long-context language models in basic retrieval tasks and demonstrates the effectiveness of long-chain reasoning prompts to overcome these issues.
Findings
Models fail in basic retrieval tasks without sufficient reasoning.
Adding reasoning steps improves model performance significantly.
Long-CoT methods are essential for solving long-context tasks effectively.
Abstract
Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases. Later, we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts. This result emphasizes the potential necessity of solving specific long-context tasks using long-CoT methods, while previous long-context benchmarks always ignore the necessity of long reasoning for long-context tasks and treat them as direct QA tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
