Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?
Neeladri Bhuiya, Viktor Schlegel, Stefan Winkler

TL;DR
This paper investigates whether large language models genuinely perform multi-hop reasoning or exploit superficial cues, revealing their vulnerabilities to plausible yet incorrect reasoning chains and proposing a challenging benchmark.
Contribution
The study uncovers subtle ways LLMs bypass multi-hop reasoning and introduces a new benchmark with plausible distractors to evaluate their reasoning capabilities.
Findings
LLMs' performance drops up to 45% with plausible distractors
Models tend to ignore lexical cues but struggle with misleading reasoning paths
Proposed benchmark reveals vulnerabilities in current LLM reasoning abilities
Abstract
State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsSparse Evolutionary Training · Focus
