Think before you speak: Training Language Models With Pause Tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon,, Sanjiv Kumar, Vaishnavh Nagarajan

TL;DR
This paper introduces pause tokens to allow language models to perform extra computation before generating each token, improving performance on various tasks by delaying output until additional processing is completed.
Contribution
The paper proposes a novel pause token mechanism enabling models to manipulate extra hidden vectors before output, enhancing task performance through inference-time delays.
Findings
Significant improvements on question-answering and reasoning tasks.
Effective delay-based training enhances model accuracy.
Demonstrated gains on multiple downstream tasks.
Abstract
Language models generate responses by producing a series of tokens in immediate succession: the token is an outcome of manipulating hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, hidden vectors, before it outputs the token? We operationalize this idea by performing training and inference on language models with a (learnable) token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and…
Peer Reviews
Decision·ICLR 2024 poster
The paper successfully gets pause tokens working, which were previously thought not to work. The writing and presentation are clear, as are the experiments and results.
The paper is convincing and demonstrates some gains can be found via pause-token training. One weakness is that the method does not compare directly to Chain of Thought (CoT). In contrast to pause tokens, which require both pretraining and fine-tuning, CoT is an inference-time only method that potentially requires extra human annotations. Another difference is that CoT has the ability to perform variable-length computation, as opposed to the fixed number of pause tokens added at inference time.
1. This paper offers an intriguing exploration into the behavior of language models by increasing computation at the level of hidden states, rather than at the text level, like Chain-of-Thought. It presents the concept of the language model "thinking" through the generation of intermediate hidden states. 2. The introduction of a "pause token" is a novel approach that enables this deeper computational process to enhance the model’s performance on various tasks. There are a few useful observation
1. Regardless of the empirical gains, we need more theoretical insights into why and how "pause tokens" work during pre-training and fine-tuning. There is not enough motivation behind this trick. We need to understand why we need to "delay a model's answer generation." There are a few intuitions, but are not well-articulated and convincing enough. The reason to answer this question is necessary because the community can benefit if the pause tokens are so important to replace normal autoregressiv
1. The paper presents a comprehensive experimental analysis of a (relatively simple) phenomena. Authors consider multiple downstream tasks, multiple model sizes, and tries several baselines. Overall, I believe that the quality of experiments is one of the stronger sides of this paper. 2. The paper focuses on a very simple algorithm, meaning that it can be easily implemented in most frameworks and libraries. 3. The paper is generally well written , both conceptual explanations and experiments w
My main problem with the paper is that it ignores several very similar works from the early years of Transformer. Probably the closest idea to pause tokens is adaptive computation time (ACT) [1]. ACT was originally proposed for recurrent neural networks with the exact same motivation: to let the model "think" before generating a difficult token. The idea of ACT translates easily to Transformers, and many subsequent works in the last 7 years (e.g. [2,3]) use ACT for various transformer types, in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
