Reconsidering the Past: Optimizing Hidden States in Language Models
Davis Yoshida, Kevin Gimpel

TL;DR
This paper introduces Hidden-State Optimization (HSO), a gradient-based method that enhances transformer language model performance at inference by updating hidden states, leading to better perplexity and few-shot evaluation results without additional training.
Contribution
The paper proposes HSO, a novel inference-time optimization technique that updates hidden states in transformer models, improving performance on various datasets and evaluation settings.
Findings
HSO improves perplexity on WikiText103 and PG-19 datasets.
HSO enhances few-shot prompt-based evaluation results.
HSO shows benefits especially outside training distribution.
Abstract
We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models at inference time. Similar to dynamic evaluation (Krause et al., 2018), HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters. We test HSO with pretrained Transformer-XL and GPT-2 language models, finding improvement on the WikiText103 and PG-19 datasets in terms of perplexity, especially when evaluating a model outside of its training distribution. We also demonstrate downstream applicability by showing gains in the recently developed prompt-based few-shot evaluation setting, again with no extra parameters or training data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · *Communicated@Fast*How Do I Communicate to Expedia? · Layer Normalization · Byte Pair Encoding · Cosine Annealing · Variational Dropout · Adaptive Softmax
