Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Minh Nhat Nguyen; Andrew Baker; Clement Neo; Allen Roush; Andreas Kirsch; Ravid Shwartz-Ziv

arXiv:2407.01082·cs.CL·November 21, 2025·1 cites

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, Ravid Shwartz-Ziv

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces min-p sampling, a dynamic token sampling method for LLMs that enhances the quality and diversity of generated text, especially at higher temperatures, by adjusting thresholds based on model confidence.

Contribution

We propose min-p sampling, a novel dynamic truncation technique that improves text coherence and creativity in LLM outputs across various models and sizes.

Findings

01

Min-p sampling outperforms top-p sampling in quality and diversity.

02

Human evaluations favor min-p sampling for creativity and coherence.

03

The method is widely adopted in open-source LLM frameworks.

Abstract

Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. Popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures which lead to incoherent or repetitive outputs. We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by using the top token's probability as a scaling factor. Our experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing show that min-p sampling improves both the quality and diversity of generated text across different model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), especially at higher temperatures. Human evaluations further show a clear preference for min-p sampling, in both text…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

- New sampling method: This paper proposes the Min-p Sampling method for better control over the diversity of generated outputs compared to fixed threshold methods like top-p. - Conducted experiments: The authors conducted experiments across tasks, ablation studies, and human evaluation. - High reproducibility: The author released the implementation, code, and repo with implementation guidelines, which enhances the reproducibility. - Wide Applicability: The proposed method can be easily integ

Weaknesses

- The experiment is limited to Mistral models and fails to demonstrate applicability with other models. It would be more comprehensive and interesting to see results from additional models, such as LLaMA3. - The effectiveness of min-p sampling highly depends on the base probability thresholds. As shown in Table 6 (ablation study results), the choice of thresholds significantly impacts LLM performance. This indicates that optimal performance requires careful tuning, which could limit the method’

Reviewer 02Rating 10Confidence 4

Strengths

* Sampling is one of those areas were the model per se needs to be complemented with an outside algorithm, allowing for creativity on how to set this up. This work proposes an original twist to a popular choice * The proposal is simple, appealing and * has good empirical results, both as measured on benchmarks and (more important) by adoption of the community

Weaknesses

The new 10p limit has not been handled wisely in my opinion, and the paper could do more with less text. In particular, Sect 4 could be removed without much loss to the overall apper Having experiments on a 123B has to be commended. The paper would be stronger however if the authors could show that the results hold on different model families (eg, llama and mistral), as otherwise it is not clear if this method provides gains on one family only

Reviewer 03Rating 10Confidence 4

Strengths

This paper presents compelling evidence that its single contribution, min-p sampling, is highly effective. The usage of it in 54,000 Github repositories alone is very impressive. In addition to that, they produced theoretical reasoning why their method works, LLM-generated statistics with explanations about how to interpret these statistics, additional statistics which involved human participants, examples of seeing how the logits are transformed under different distributions which give additi

Weaknesses

The one contribution of this paper, min-p sampling, is extremely simple and not mathematically "deep" at all - no theorems were presented, and the code implementation literally (was provided and) took less than one page. However, I think that having such a paper in a conference proceeding is not a bad thing.

Videos

Greedy? Min-p? Beam Search? How LLMs Actually Pick Words – Decoding Strategies Explained· youtube

Taxonomy

TopicsCreativity in Education and Neuroscience

MethodsLLaMA · Balanced Selection