Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Muneeb Ur Raheem Khan

TL;DR
This paper introduces a decoding-time debiasing method for large language models using a Process Reward Model to evaluate and select less biased outputs without retraining, applicable to both controlled and open-ended generation.
Contribution
It proposes a novel, weight-free decoding-time debiasing framework with three schemes, evaluated across multiple models and bias categories, improving fairness while maintaining fluency.
Findings
Sequential critique-and-revise scheme significantly reduces bias scores.
The framework maintains or improves fluency in debiased outputs.
Overhead for debiasing is minimal, near 2x for well-calibrated models.
Abstract
Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights. A separate Process Reward Model (PRM) acts as a judge, scoring each candidate for both fairness and fluency. We design three schemes of increasing sophistication (Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit) and evaluate them on four models (GPT-4o-mini, Llama 3.2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
