Are the Values of LLMs Structurally Aligned with Humans? A Causal   Perspective

Yipeng Kang; Junqi Wang; Yexin Li; Mengmeng Wang; Wenming Tu; Quansen; Wang; Hengli Li; Tingjun Wu; Xue Feng; Fangwei Zhong; Zilong Zheng

arXiv:2501.00581·cs.CL·February 25, 2025

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective

Yipeng Kang, Junqi Wang, Yexin Li, Mengmeng Wang, Wenming Tu, Quansen, Wang, Hengli Li, Tingjun Wu, Xue Feng, Fangwei Zhong, Zilong Zheng

PDF

Open Access 1 Video

TL;DR

This paper explores the underlying causal structure of LLMs' values, revealing differences from human values, and introduces lightweight, effective methods for more precise value alignment and steering.

Contribution

It proposes the concept of a causal value graph for LLMs and develops two novel, resource-efficient value-steering techniques based on this framework.

Findings

01

Causal value graphs differ significantly from human value systems.

02

Role-based prompting and SAE steering improve value alignment.

03

Experiments show enhanced control and effectiveness in LLMs.

Abstract

As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management

MethodsSparse Evolutionary Training · Focus · Sparse Autoencoder