How Does Controllability Emerge In Language Models During Pretraining?
Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho

TL;DR
This paper investigates how the ability to steer language models via linear transformations of hidden states develops during training, revealing that concept steerability emerges progressively and can be interpreted through a unified framework called the Intervention Detector.
Contribution
The paper introduces the Intervention Detector framework to analyze the emergence of linear steerability during training across different models, providing new insights into the dynamics of concept separability.
Findings
Linear steerability emerges during intermediate training stages.
Related concepts show steerability at different training points.
Concepts become more linearly separable as training progresses.
Abstract
Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the "Intervention Detector" (ID), which is designed to reveal how linear steerability evolves over the course of…
Peer Reviews
Decision·Submitted to ICLR 2025
- I find this paper is working on a very interesting topic that is worth investigating, i.e., when the controllability emerges during pre-training. It is not only valuable to people working on tasks like knowledge editing, but also helps us understand the learning procedures of LLMs. - One interesting finding that I like is how the control of different (emotional) concepts emerge differently from each other. Maybe it is worth check whether this is consistent with human beings.
As a paper that investigates the "controllability" (as its title suggests), though the paper has many interesting findings, I expected it to consider more control techniques and factors, whereas the current paper focuses on a very specific type of control, i.e., intervention, and the control of a very specific kind of factors, i.e., concepts related to emotions. Another major risk of the paper is its writing, making the paper hard to follow. I am saying this with the following concerns: - The p
The idea of detecting when high-level concepts, such as a model’s understanding of emotions, can be controlled for the first time during training (by testing when interventions show an effect) is exciting and promising. The article is rich with graphics that present the results in an intuitive way. The latter are interesting and motivate future work.
Unfortunately, the article suffers from technical inaccuracies that make it very difficult to trace what exactly the authors did in their experiments. Section 3: Firstly, there are no definitions for $h_+, h_-$, the normalized() function, $S_{test}, A_i$, and checkpoints. The Appendix furthermore lists $H^+$ and $H^-$, which were never defined. Could the authors include a clear (sub)section for definitions, perhaps at the beginning or within an expanded notation subsection, to define all key
The idea to explore the emergence of linear steerability over pre-training is quite novel; past works only focus on the final configuration of the LM (but this is for good reason, see weaknesses). Experiments test a broad range of concepts.
I have several concerns about the paper, summarized broadly as follows: **Major weaknesses (impacted score)** 1. __Motivation unclear/unconvincing:__ it was unclear why when controllability emerges should matter. In particular, - While the authors state that past work focuses on steerability of already-trained language models, this is the actual use case of LM steering. In contrast, the authors would need to make a strong case for investigating the emergence of controllability-- to me, it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Topic Modeling · Sentiment Analysis and Opinion Mining
