Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models
Hanze Guo, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie

TL;DR
This paper introduces COUPLE, a counterfactual reasoning framework using causal models to improve the alignment of large language models with complex, pluralistic human values, enabling nuanced control and better interpretability.
Contribution
It proposes a novel causal modeling and counterfactual reasoning approach for aligning LLMs with multiple, interdependent human values, addressing value complexity and steerability.
Findings
COUPLE outperforms baselines on multiple value objectives
It improves interpretability of value alignment
Demonstrates effective control over nuanced value priorities
Abstract
As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz's Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Explainable Artificial Intelligence (XAI)
