TL;DR
This paper introduces Epistemic Independence Training (EIT), a reinforcement learning method that reduces bias influence in large language models, improving reasoning robustness and generalization across biases and benchmarks.
Contribution
EIT is a novel RL framework that makes bias cues non-predictive of reward, leading to more unbiased and transferable reasoning in LLMs.
Findings
EIT improves accuracy and robustness against adversarial biases.
Models trained with EIT generalize to unseen bias types.
EIT outperforms existing methods like GroupDRO and IRM across multiple benchmarks.
Abstract
Large language models (LLMs) increasingly serve as reasoners and automated evaluators, yet they remain susceptible to cognitive biases -- often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals.} Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues attractive. We propose \textbf{Epistemic Independence Training (EIT)}, a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
