Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations
Somnath Banerjee, Pranav Jha, Rima Hazra, Animesh Mukherjee

TL;DR
This paper investigates the trade-off between plausibility and faithfulness in cross-lingual explanations of multilingual large language models, revealing that English explanations often lack causal grounding despite high span agreement.
Contribution
It systematically evaluates the fidelity of English-pivot explanations across multiple tasks, languages, and models, highlighting their limitations and proposing improved auditing practices.
Findings
English explanations achieve higher span agreement but are less causally grounded.
Comprehensiveness drops by up to 5.7x in non-native languages.
English explanations fail to preserve pragmatic cues in socially nuanced tasks.
Abstract
LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
