Understanding the Mechanism of Altruism in Large Language Models
Shuhuai Zhang, Shu Wang, Zijun Yao, Chuanhao Li, Xiaozhi Wang, Songfa Zhong, and Tracy Xiao Liu

TL;DR
This paper investigates the internal mechanisms of altruism in large language models using sparse autoencoders and causal interventions, revealing identifiable features associated with prosocial behavior.
Contribution
It introduces a novel framework combining sparse autoencoders and benchmark tasks to interpret and manipulate altruistic behavior in LLMs.
Findings
Identified a small set of features strongly linked to altruistic behavior.
Causal interventions can reliably shift the model's social preferences.
Features corresponding to heuristic and deliberative processes influence LLM altruism.
Abstract
Altruism is fundamental to human societies, fostering cooperation and social cohesion. Recent studies suggest that large language models (LLMs) can display human-like prosocial behavior, but the internal computations that produce such behavior remain poorly understood. We investigate the mechanisms underlying LLM altruism using sparse autoencoders (SAEs). In a standard Dictator Game, minimal-pair prompts that differ only in social stance (generous versus selfish) induce large, economically meaningful shifts in allocations. Leveraging this contrast, we identify a set of SAE features (0.024% of all features across the model's layers) whose activations are strongly associated with the behavioral shift. To interpret these features, we use benchmark tasks motivated by dual-process theories to classify a subset as primarily heuristic (System 1) or primarily deliberative (System 2). Causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
