Fairness Aware Reward Optimization
Ching Lam Choi, Vighnesh Subramaniam, Phillip Isola, Antonio Torralba, Stefanie Jegelka

TL;DR
This paper introduces Faro, a novel in-processing framework for training reward models that incorporate fairness constraints, ensuring fairer large language model alignment without sacrificing performance.
Contribution
Faro is the first framework providing theoretical guarantees for reward-level fairness in LLM alignment, balancing fairness and accuracy through KL-regularized fine-tuning.
Findings
Faro achieves provable fairness certificates with controllable slack.
Faro effectively reduces bias and harmful outputs in LLMs.
Faro maintains or improves model quality while enforcing fairness.
Abstract
Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses an important and timely problem in LLM alignment, ensuring fairness during the reward modeling phase. Incorporating algorithmic fairness constraints into this stage is an important direction given the growing societal impact of biased model behavior. The work attempts to provide theoretical guarantees for fairness compliance and analyzes the accuracy–fairness trade-off induced by RL fine-tuning. It also highlights an underexplored yet socially significant issue, namely that b
This paper reads poorly in terms of presentation. For instance, there are many issues with definitions and notations, which make the paper difficult to follow. The symbol $\mathcal{J}$ first appears in Equation (1) on page 3 (line 128), but it is only formally defined on page 4 (line 145). On page 4 (line 167), the definition of $q$ is too informal. The events $\mathcal{E}$ and $\mathcal{E}'$ seem to play an important role in the definition of the $q$ function, but they are rarely mentioned or
1. The paper focuses on improving fairness in reward optimization, which is a very essential domain to explore given the increasing reliance on LLMs in high-stake applications. 2. The proposed algorithm can be applicable to various important group fairness metrics, including demographic parity (DP) and equality of opportunity (EO). 3. The overall design is based on some theoretical backgrounds.
My main concerns lie in the empirical verification of the proposed method, as the current experimental setup raises several questions regarding the robustness and generalizability of the findings. 1. The baseline data points in the LLM experiment are very limited. For example, there is no explicit baseline data provided for delta_dp, delta_eo, or delta_cf. It is very critical to observe the performance changes in these fairness metrics, especially given that the algorithm is specifically designe
Considering fairness in the setting of RLHF is well motivated and timely. The authors give a clear problem formulation and develop practical reformulations and optimization methods to solve the problem. The paper offers both theoretical and empirical insights, which together make a complete set of results. Overall, the work is also well structured.
- The exposition is sometimes too sketchy on notation and key definitions, which makes the paper difficult to follow for non-experts. For example, the fairness notions in Section 2.2 are introduced largely in abstract terms, without concrete explanation of the variables and notations involved. This level of abstraction may be fine for domain experts but does not help with the accessibility for a broader audience. - While the motivation is strong and the problem is formulated rigorously, the tec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing · Explainable Artificial Intelligence (XAI)
