Estimating the Probabilities of Rare Outputs in Language Models

Gabriel Wu; Jacob Hilton

arXiv:2410.13211·cs.LG·February 7, 2025

Estimating the Probabilities of Rare Outputs in Language Models

Gabriel Wu, Jacob Hilton

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper explores methods for estimating extremely low probabilities of rare outputs in language models, comparing importance sampling and activation extrapolation, with importance sampling showing superior performance.

Contribution

It introduces and evaluates two approaches for low probability estimation in language models, highlighting importance sampling as more effective than activation extrapolation.

Findings

01

Importance sampling outperforms activation extrapolation.

02

Both methods outperform naive sampling.

03

Low probability estimation aids in improving worst-case guarantees.

Abstract

We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model's output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model's logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 10Confidence 4

Strengths

- The paper has a novel framing of low probability estimation as a concrete approach to understanding and improving worst-case model performance. It is the first paper to explore extrapolation-based methods that don't require finding explicit examples of rare behaviors. - The best performing method presented in the paper, MHIS, is motivated using existing work in automated redteaming. By adapting the Greedy Coordinate Gradient technique (previously used for finding jailbreaks) into a proposal

Weaknesses

- The models tested are small, and it's unclear whether these methods can be scaled up to actual models of interest. However, the decision to use smaller models is justified, as it would be otherwise impossible to compute the true probability on all 2^32 input samples, which is necessary for measuring the performance of the methods. - Given that adversarial training is stated as a part of the motivation for developing low probability estimation methods, there is a gap in the discussion around

Reviewer 02Rating 6Confidence 2

Strengths

I am not familiar with prior work in this domain, but it appears as if the paper introduces novel methods for trying to estimate the probabilities of certain outputs given an input distribution. Many intuitions are described in prose and are very clear/helpful: - Why MHIS may be needed, why this is better for higher capacity models - Why activation extrapolation methods could be better in certain scenarios. - Why formally defined distribution can be relevant. The experiments are sound and the

Weaknesses

To my understanding, the main motivation described for this problem stems from using these methods to reduce the probability of rare but undesirable outputs. Although a reasonable motivation, there are limited results in this direction because, as the authors point out, doing this for importance sampling reduces to adversarial training, and the activation extrapolation methods are currently worse than the importance sampling methods. - A more thoroughly description of how you would do this for a

Reviewer 03Rating 5Confidence 4

Strengths

- The paper is well written. - The paper provides simple methods for low probability estimation in LLMs. - Empirical results validate the working of their proposed methods (the importance sampling works better as shown by the authors).

Weaknesses

- One major drawback I find is the lack of clear goal of the paper. The paper seems to provide some interesting algorithm for low probability estimation and high level discussion of how they can be used in practice. But there is no clear application or experiment that shows the usefulness of the contribution. I would urge the authors to kindly provide some clear applications and experiments to show its usefulness.

Code & Models

Repositories

alignment-research-center/low-probability-estimation
jaxOfficial

Videos

Estimating the Probabilities of Rare Outputs in Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling