Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation
Hefei Xu, Le Wu, Chen Cheng, Hao Liu

TL;DR
This paper introduces a novel multi-value alignment framework for large language models that minimizes value interference and uses extrapolation to explore diverse value trade-offs, improving alignment with human values.
Contribution
The paper proposes a new multi-value alignment method that reduces parameter interference and employs value extrapolation to better balance conflicting human values.
Findings
Outperforms existing methods in multi-value alignment tasks
Effectively handles conflicting human values
Constructs diverse LLMs with optimized value trade-offs
Abstract
With the rapid advancement of large language models (LLMs), aligning them with human values for safety and ethics has become a critical challenge. This problem is especially challenging when multiple, potentially conflicting human values must be considered and balanced. Although several variants of existing alignment methods (such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)) have been proposed to address multi-value alignment, they suffer from notable limitations: 1) they are often unstable and inefficient in multi-value optimization; and 2) they fail to effectively handle value conflicts. As a result, these approaches typically struggle to achieve optimal trade-offs when aligning multiple values. To address this challenge, we propose a novel framework called Multi-Value Alignment (MVA). It mitigates alignment degradation caused by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing
