TL;DR
This paper introduces ShaPO, a geometry-aware optimization framework that enhances the robustness of large language model safety alignment by controlling optimization geometry, especially under distribution shifts and noisy data.
Contribution
ShaPO is a novel framework that enforces worst-case alignment objectives through selective geometry control, improving robustness over existing methods.
Findings
ShaPO consistently outperforms popular preference optimization methods across safety benchmarks.
ShaPO stabilizes likelihood-based optimization and enforces reward consistency under noisy supervision.
Combining ShaPO with data-robust objectives yields further robustness improvements.
Abstract
Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Adversarial Robustness in Machine Learning · Bayesian Modeling and Causal Inference
