Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning

Sheryl Mathew; N Harshit

arXiv:2508.19567·cs.LG·August 28, 2025

Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning

Sheryl Mathew, N Harshit

PDF

TL;DR

This paper introduces a counterfactual reward model with causal inference techniques to mitigate biases in multimodal reinforcement learning, improving fairness and robustness in fake news detection.

Contribution

It presents a novel counterfactual trust score that decomposes biases and enhances bias resilience in reward models for multimodal RLHF.

Findings

01

Achieved 89.12% accuracy in fake news detection

02

Reduced spurious correlations and unfair signals

03

Demonstrated robustness against synthetic bias injections

Abstract

In reinforcement learning with human feedback (RLHF), reward models can efficiently learn and amplify latent biases within multimodal datasets, which can lead to imperfect policy optimization through flawed reward signals and decreased fairness. Bias mitigation studies have often applied passive constraints, which can fail under causal confounding. Here, we present a counterfactual reward model that introduces causal inference with multimodal representation learning to provide an unsupervised, bias-resilient reward signal. The heart of our contribution is the Counterfactual Trust Score, an aggregated score consisting of four components: (1) counterfactual shifts that decompose political framing bias from topical bias; (2) reconstruction uncertainty during counterfactual perturbations; (3) demonstrable violations of fairness rules for each protected attribute; and (4) temporal reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.