How Does Controllability Emerge In Language Models During Pretraining?

Jianshu She; Xinyue Li; Eric Xing; Zhengzhong Liu; Qirong Ho

arXiv:2508.01892·cs.LG·August 5, 2025

How Does Controllability Emerge In Language Models During Pretraining?

Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how the ability to steer language models via linear transformations of hidden states develops during training, revealing that concept steerability emerges progressively and can be interpreted through a unified framework called the Intervention Detector.

Contribution

The paper introduces the Intervention Detector framework to analyze the emergence of linear steerability during training across different models, providing new insights into the dynamics of concept separability.

Findings

01

Linear steerability emerges during intermediate training stages.

02

Related concepts show steerability at different training points.

03

Concepts become more linearly separable as training progresses.

Abstract

Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the "Intervention Detector" (ID), which is designed to reveal how linear steerability evolves over the course of…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

- I find this paper is working on a very interesting topic that is worth investigating, i.e., when the controllability emerges during pre-training. It is not only valuable to people working on tasks like knowledge editing, but also helps us understand the learning procedures of LLMs. - One interesting finding that I like is how the control of different (emotional) concepts emerge differently from each other. Maybe it is worth check whether this is consistent with human beings.

Weaknesses

As a paper that investigates the "controllability" (as its title suggests), though the paper has many interesting findings, I expected it to consider more control techniques and factors, whereas the current paper focuses on a very specific type of control, i.e., intervention, and the control of a very specific kind of factors, i.e., concepts related to emotions. Another major risk of the paper is its writing, making the paper hard to follow. I am saying this with the following concerns: - The p

Reviewer 02Rating 5Confidence 4

Strengths

The idea of detecting when high-level concepts, such as a model’s understanding of emotions, can be controlled for the first time during training (by testing when interventions show an effect) is exciting and promising. The article is rich with graphics that present the results in an intuitive way. The latter are interesting and motivate future work.

Weaknesses

Unfortunately, the article suffers from technical inaccuracies that make it very difficult to trace what exactly the authors did in their experiments. Section 3: Firstly, there are no definitions for $h_+, h_-$, the normalized() function, $S_{test}, A_i$, and checkpoints. The Appendix furthermore lists $H^+$ and $H^-$, which were never defined. Could the authors include a clear (sub)section for definitions, perhaps at the beginning or within an expanded notation subsection, to define all key

Reviewer 03Rating 5Confidence 3

Strengths

The idea to explore the emergence of linear steerability over pre-training is quite novel; past works only focus on the final configuration of the LM (but this is for good reason, see weaknesses). Experiments test a broad range of concepts.

Weaknesses

I have several concerns about the paper, summarized broadly as follows: **Major weaknesses (impacted score)** 1. __Motivation unclear/unconvincing:__ it was unclear why when controllability emerges should matter. In particular, - While the authors state that past work focuses on steerability of already-trained language models, this is the actual use case of LM steering. In contrast, the authors would need to make a strong case for investigating the emergence of controllability-- to me, it

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Topic Modeling · Sentiment Analysis and Opinion Mining