TL;DR
This paper investigates the phenomenon of cross-objective interference in multi-objective alignment of large language models, providing a theoretical framework, empirical analysis, and a mitigation method called CTWA.
Contribution
It formalizes cross-objective interference, derives a covariance law explaining it, and introduces CTWA to mitigate interference in multi-objective LLM training.
Findings
Interference is widespread and model-dependent.
Positive covariance between reward and scalarized score improves objectives.
CTWA effectively reduces cross-objective interference.
Abstract
We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
