C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs
Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, Shuai Zhao

TL;DR
This paper introduces C2PO, a unified framework that identifies and suppresses bias shortcuts in large language models by leveraging causal signals, improving fairness and reducing stereotypes without sacrificing reasoning performance.
Contribution
The paper proposes C2PO, a novel alignment method that simultaneously addresses multiple bias types in LLMs using causal counterfactuals and dynamic suppression mechanisms.
Findings
C2PO effectively reduces stereotypical biases across multiple benchmarks.
C2PO maintains strong reasoning capabilities while mitigating biases.
Extensive experiments validate the framework's robustness and versatility.
Abstract
Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
