Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning
Sheryl Mathew, N Harshit

TL;DR
This paper introduces a counterfactual reward model with causal inference techniques to mitigate biases in multimodal reinforcement learning, improving fairness and robustness in fake news detection.
Contribution
It presents a novel counterfactual trust score that decomposes biases and enhances bias resilience in reward models for multimodal RLHF.
Findings
Achieved 89.12% accuracy in fake news detection
Reduced spurious correlations and unfair signals
Demonstrated robustness against synthetic bias injections
Abstract
In reinforcement learning with human feedback (RLHF), reward models can efficiently learn and amplify latent biases within multimodal datasets, which can lead to imperfect policy optimization through flawed reward signals and decreased fairness. Bias mitigation studies have often applied passive constraints, which can fail under causal confounding. Here, we present a counterfactual reward model that introduces causal inference with multimodal representation learning to provide an unsupervised, bias-resilient reward signal. The heart of our contribution is the Counterfactual Trust Score, an aggregated score consisting of four components: (1) counterfactual shifts that decompose political framing bias from topical bias; (2) reconstruction uncertainty during counterfactual perturbations; (3) demonstrable violations of fairness rules for each protected attribute; and (4) temporal reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
