TL;DR
This paper investigates whether large language models' reasoning abilities transfer across languages, revealing significant reasoning-conclusion misalignments especially in non-Latin scripts, and introduces a framework for better evaluation.
Contribution
It presents a human-validated framework for assessing reasoning transfer across languages and uncovers prevalent evidential and logical errors in multilingual model reasoning.
Findings
Models achieve high accuracy but often have unsupported reasoning.
Non-Latin scripts exhibit twice as much reasoning-conclusion misalignment.
Evidential errors are the primary cause of reasoning failures.
Abstract
Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
