The Compositional Architecture of Regret in Large Language Models
Xiangxiang Cui, Shu Yang, Tianjin Huang, Wanyu Lin, Lijie Hu, Di Wang

TL;DR
This paper investigates how large language models express and process regret, introducing new datasets and metrics to analyze internal representations, revealing layered and neuron-specific mechanisms of regret handling.
Contribution
It presents a novel workflow, metrics, and analysis methods for identifying and understanding regret expressions and neurons in large language models.
Findings
Identified optimal regret representation layer using S-CDI metric.
Discovered an M-shaped decoupling pattern across model layers.
Categorized neurons into regret, non-regret, and dual groups.
Abstract
Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
