Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment
Lijun Zhang, Lin Li, Wei Wei, Yajie Qi, Huizhong Song, Jun Wang, Yaodong Yang, Jiye Liang

TL;DR
This paper introduces Risk-aware Stepwise Alignment (RSA), a novel method for fine-tuning language models that explicitly incorporates risk measures to improve safety and robustness against harmful behaviors.
Contribution
RSA is a new alignment approach that integrates nested risk measures into token-level policy optimization, enhancing safety and reducing tail risks in language models.
Findings
RSA achieves high helpfulness and safety levels.
It significantly suppresses tail risks and unsafe responses.
Theoretical analysis supports policy optimality under mild assumptions.
Abstract
When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education
