Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Anushka Yadav; Isha Nalawade; Srujana Pillarichety; Yashwanth Babu; Reshmi Ghosh; Samyadeep Basu; Wenlong Zhao; Ali Nasaeh; Sriram Balasubramanian; Soundararajan Srinivasan

arXiv:2508.04699·cs.CL·August 7, 2025

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh, Samyadeep Basu, Wenlong Zhao, Ali Nasaeh, Sriram Balasubramanian, Soundararajan Srinivasan

PDF

TL;DR

This paper investigates reasoning failures in contemporary language models during multi-hop question answering, introducing a detailed error framework and providing insights to improve reasoning accuracy and robustness.

Contribution

It presents a novel error categorization framework for analyzing reasoning failures and offers insights to enhance model reasoning capabilities.

Findings

01

Identified diverse error patterns in multi-hop reasoning

02

Revealed limitations in source document coverage and overthinking

03

Provided actionable guidance for improving reasoning fidelity

Abstract

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.