On the Effect of Sampling Diversity in Scaling LLM Inference

Tianchun Wang; Zichuan Liu; Yuanzhou Chen; Jonathan Light; Weiyang Liu; Haifeng Chen; Xiang Zhang; Wei Cheng

arXiv:2502.11027·cs.LG·December 22, 2025

On the Effect of Sampling Diversity in Scaling LLM Inference

Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Weiyang Liu, Haifeng Chen, Xiang Zhang, Wei Cheng

PDF

Open Access 3 Reviews

TL;DR

This paper systematically analyzes how prompt sampling diversity influences large language model inference, providing theoretical insights and empirical results that demonstrate significant performance gains when diversity is effectively utilized.

Contribution

It offers a theoretical explanation for the benefits of diversified sampling in LLM inference and introduces a diversity-fidelity trade-off principle for designing sampling strategies.

Findings

01

Diversified sampling reduces error rates in Best-of-N inference.

02

Effective diversity strategies can improve reasoning, mathematics, and code generation performance.

03

Diversity may diminish under majority voting, affecting inference outcomes.

Abstract

Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful response diversity, we systematically study the effect of prompt diversity in scaling inference. We theoretically explain why diversified sampling improves Best-of-N scaling, showing that responses generated from diverse prompts after Best-of-N selection exhibit significantly lower error rates than those produced from stationary prompts. Building on this analysis, we derive a diversity-fidelity trade-off principle, that guides the design of sampling strategies introducing diversity. From this guidance, we instantiate a family of effective perturbation styles. We theoretically and empirically characterize when diversified exploration remains effective,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- It gives a theoretically sound justification of why exploration diversity improves best-of-N. - The evaluation is comprehensive, which covers the task-level perturbations and query-level perturbations.

Weaknesses

- The presentation could be improved if the notation or symbols could be used more concisely or minimally. And “RNG” is never defined. - The line 144 is misleading. Unless the success probability of LLM solving one particular problem is defined precisely to be conditioned on the diversity source, it (i.e., the success probability) should be static. I can understand the author’s intention. Maybe the author can define it as a sampling-based success probability where different sampling diversities

Reviewer 02Rating 2Confidence 3

Strengths

1. Introduces a novel and systematic study of sampling diversity in LLM inference, combining theoretical proof, perturbation design, and empirical validation. 2. Demonstrates that exploration diversity can significantly enhance LLM performance without retraining, offering a practical, general-purpose strategy for improving test-time scaling across domains.

Weaknesses

1. The writing and presentation are weak and difficult to follow. For example, in Section 3 (line 144), the term RNG appears without any prior introduction or explanation. Additionally, the two hypotheses are presented without sufficient depth or illustrative support; although Remark 3.2 attempts clarification, it remains unclear. Including concrete examples or clearer illustrations would greatly improve readability and understanding. 2. The central insight of this paper is that incorporating p

Reviewer 03Rating 4Confidence 4

Strengths

**Theoretical explanation.** The paper presents a formal theoretical result showing that diversified sampling reduces the failure probability in Best-of-N inference. This offers a conceptual foundation rather than only relying on empirical results. **Experimental results.** The authors conduct extensive experiments across diverse tasks and settings, providing strong empirical evidence to support their ideas. **Practical implications.** The work offers practical insights into designing effectiv

Weaknesses

**1. Missing cost consideration in experiments.** While the experiments demonstrate that diversified sampling improves LLM performance, the paper does not discuss the associated computational or time costs. An analysis of the trade-off between performance gains and inference cost would strengthen the empirical evaluation and clarify the practical value of the approach. **2. Connections and readability.** The paper lacks clear connections between sections—specifically between notation, theoretic

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Neural Networks and Applications · Topic Modeling