Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

Michal \v{S}tef\'anik; Timothee Mickus; Marek Kadl\v{c}\'ik; Michal Spiegel; Josef Kucha\v{r}

arXiv:2508.18407·cs.CL·August 27, 2025

Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

Michal \v{S}tef\'anik, Timothee Mickus, Marek Kadl\v{c}\'ik, Michal Spiegel, Josef Kucha\v{r}

PDF

TL;DR

This paper critically examines the effectiveness of out-of-distribution evaluations in question answering, revealing they often fail to accurately detect models' reliance on shortcuts and suggesting more robust evaluation methods.

Contribution

It challenges the assumption that OOD evaluations reliably reflect real-world model failures and proposes improved methodologies for assessing generalization in QA models.

Findings

01

OOD datasets vary greatly in their ability to detect shortcut reliance

02

Shared spurious features across datasets can mislead OOD evaluations

03

Current OOD evaluations often underperform compared to in-distribution tests

Abstract

A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.