Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Jian-Qiao Zhu; Haijiang Yan; Thomas L. Griffiths

arXiv:2505.11615·cs.CL·May 20, 2025

Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method to modify large language models' behavior by aligning behavioral and neural representations to identify steering vectors, enabling targeted influence without retraining.

Contribution

It presents a systematic approach to find steering vectors by aligning behavioral and neural representations, specifically for modulating risk preferences in LLMs.

Findings

01

Successfully modulated LLM risk-related outputs

02

Aligned neural and behavioral representations effectively

03

Steering vectors reliably influence model behavior

Abstract

Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of representation engineering, offer an effective and targeted means of influencing model behavior without retraining or fine-tuning the model. But how can such steering vectors be systematically identified? We propose a principled approach for uncovering steering vectors by aligning latent representations elicited through behavioral methods (specifically, Markov chain Monte Carlo with LLMs) with their neural counterparts. To evaluate this approach, we focus on extracting latent risk preferences from LLMs and steering their risk-related outputs using the aligned representations as steering vectors. We show that the resulting steering vectors successfully…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The approach is novel, and a persuasive case is made that the MCMC method is the right way to capture the structure of risk preferences in an LLM.

Weaknesses

The much simpler Contrastive Activation method is not offered a fair comparison. The paper's contrast of "risk" and "safety" related words would have induced a vector related to the abstract concept of risk, but the behavioral methods identify vectors related to quantitative risk preferences. Thus when steering on risk preference-related prompts (Figure 3), the former is ineffective. A more appropriate comparison would be to a vector formed by contrasting risky with safe choices. The paper's con

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper proposes a new self-alignment method that uniquely combines the probability triangle, MCMC method, and lasso regression to manipulate risk preferences in large language models. 2. The experimental evaluation is extensive, covering three different but related tasks: risky decision-making, risk perception, and text generation. 3. The method uses the model itself to identify steering vectors associated with risk preferences, showing great potential for practical applications.

Weaknesses

1. The experiments are conducted only on limited Gemma-models. It is unclear whether the same performance improvement can be achieved on other large-scale models. 2. Lack of the key details. During the construction of the steering vector, the paper does not explain how the Lasso regularization coefficient, MCMC sampling steps, or injection layer selection were chosen. 3. High initialization cost. When switching to another model or modifying other attributes, the steering process must start fro

Reviewer 03Rating 6Confidence 4

Strengths

1. **Novelty:** Adaptation of the MCMC procedure from Noguchi et. al. (2013) by replacing people with an LLM is an interesting touch with its application in LLM steering. 2. **Originality:** Estimating the steering vector from models own preference without external datasets of opposing prompts is not very common (to my knowledge). 3. **Significance:** The techniques (in Step 1 and 2), even though individually not brand new, adopted for LLMs can be useful contribution to the ICLR community workin

Weaknesses

1. **Writing:** Although the paper is well-written and presents an easy-to-follow narrative, Section 3 reads with some friction, as most mathematical objects are described verbally rather than symbolically. Explicitly casting the output samples from Step 1 into mathematical variables, passing them into Step 2, and formally expressing the lasso regression problem would lower the cognitive load required from readers to understand the method. 2. **Generality:** The prompt set construction method (r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsFocus