Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Sohee Yang; Nora Kassner; Elena Gribovskaya; Sebastian Riedel; Mor Geva

arXiv:2411.16679·cs.CL·June 3, 2025

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, Mor Geva

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces SOCRATES, a dataset designed to evaluate large language models' latent multi-hop reasoning abilities without shortcuts, revealing models' strengths and limitations in different reasoning scenarios.

Contribution

The paper presents a novel dataset and evaluation methodology that isolates latent reasoning in LLMs, demonstrating their capabilities and gaps in multi-hop reasoning tasks.

Findings

01

LLMs show strong latent reasoning for country-related queries (80%)

02

Performance drops significantly for year-based queries (5%)

03

Latent reasoning abilities emerge during pretraining

Abstract

We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like "In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of". One major challenge in such evaluation is that LLMs may have developed shortcuts by encountering the head entity "Scarlett Johansson" and the answer entity "United States" in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities might have co-appeared during training. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

soheeyang/SOCRATES
dataset· 17 dl
17 dl

Videos

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems