Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle; Candace Ross; Sebastian Ruder; Adina Williams; Karen Ullrich; Mark Ibrahim; Levent Sagun

arXiv:2512.22712·cs.CL·March 31, 2026

Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun

PDF

1 Video

TL;DR

This paper investigates whether large language models' reasoning abilities transfer across languages, revealing significant reasoning-conclusion misalignments especially in non-Latin scripts, and introduces a framework for better evaluation.

Contribution

It presents a human-validated framework for assessing reasoning transfer across languages and uncovers prevalent evidential and logical errors in multilingual model reasoning.

Findings

01

Models achieve high accuracy but often have unsupported reasoning.

02

Non-Latin scripts exhibit twice as much reasoning-conclusion misalignment.

03

Evidential errors are the primary cause of reasoning failures.

Abstract

Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages· underline