Reconsidering the Past: Optimizing Hidden States in Language Models

Davis Yoshida; Kevin Gimpel

arXiv:2112.08653·cs.CL·December 17, 2021

Reconsidering the Past: Optimizing Hidden States in Language Models

Davis Yoshida, Kevin Gimpel

PDF

Open Access

TL;DR

This paper introduces Hidden-State Optimization (HSO), a gradient-based method that enhances transformer language model performance at inference by updating hidden states, leading to better perplexity and few-shot evaluation results without additional training.

Contribution

The paper proposes HSO, a novel inference-time optimization technique that updates hidden states in transformer models, improving performance on various datasets and evaluation settings.

Findings

01

HSO improves perplexity on WikiText103 and PG-19 datasets.

02

HSO enhances few-shot prompt-based evaluation results.

03

HSO shows benefits especially outside training distribution.

Abstract

We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models at inference time. Similar to dynamic evaluation (Krause et al., 2018), HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters. We test HSO with pretrained Transformer-XL and GPT-2 language models, finding improvement on the WikiText103 and PG-19 datasets in terms of perplexity, especially when evaluating a model outside of its training distribution. We also demonstrate downstream applicability by showing gains in the recently developed prompt-based few-shot evaluation setting, again with no extra parameters or training data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · *Communicated@Fast*How Do I Communicate to Expedia? · Layer Normalization · Byte Pair Encoding · Cosine Annealing · Variational Dropout · Adaptive Softmax