Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
Yilong Wang, Qianli Wang, Bohao Chu, Yihong Liu, Jing Yang, Simon Ostermann

TL;DR
This paper introduces Macro, a preference alignment framework using DPO to improve the validity and minimality of multilingual counterfactual explanations generated by large language models.
Contribution
Macro applies preference optimization to multilingual SCE generation, effectively balancing validity and minimality across diverse languages.
Findings
Macro improves validity by 12.55% on average over chain-of-thought baseline.
Macro outperforms translation-based baseline in minimality and validity.
Preference optimization enhances cross-lingual perturbation alignment and reduces errors.
Abstract
Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
