TL;DR
This paper introduces String Seed of Thought (SSoT), a prompting technique for large language models that enhances their ability to produce distribution-faithful and diverse responses, addressing biases and diversity issues in probabilistic instruction following tasks.
Contribution
The paper proposes SSoT, a simple prompting method that improves LLMs' probabilistic instruction following and response diversity by guiding the model to generate and manipulate an initial random string.
Findings
SSoT significantly improves PIF performance of LLMs.
SSoT enhances response diversity in open-ended tasks.
Approaches ideal pseudo-random number generator performance.
Abstract
We introduce String Seed of Thought (SSoT), a novel prompting method for LLMs that improves Probabilistic Instruction Following (PIF). We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games. It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM…
Peer Reviews
Decision·ICLR 2026 Poster
1. The SSoT prompting strategy is conceptually simple to implement (just adding a brief instruction to generate and use a random string) and is applicable to a wide range of LLMs without any model modifications. 2. SSoT dramatically improves an LLM’s ability to follow probabilistic instructions, achieving empirical sampling frequencies very close to the target probabilities and also boosts the diversity of open-ended generations, outperforming other diversity-promoting baselines like prompt par
1. This method may cause potential errors and hurt answer quality. While the method focuses on matching distributions and diversity, the paper provides little discussion on whether the use of SSoT could inadvertently affect the correctness or factuality of outputs in tasks where a specific correct answer is expected. 2. Most experiments involve tasks with a small, discrete set of outcomes (binary or a few categories), so it remains uncertain how well SSoT would scale to more complex distributio
1. This method is simple yet effective. It works well across different LLMs and task types without tuning, demonstrating strong engineering practicality. 2. This method is theoretically sound. It provides a rigorous bound showing that the TV distance between the empirical and target distribution decreases as the generated string length increases, even when the string exhibits autocorrelation. This result gives the method solid mathematical grounding. Moreover, the authors derived the bound for
1. It would be better if the authors could provide some failure case analysis. In Table 1, the performance of QwQ-32B on the 2-choice task is even worse than the baseline. Is this degradation due to autocorrelation in the generated random strings, inappropriate mapping, or possible execution/calculation errors? 2. The models used in the experiments are all quite large. Considering that SSoT relies on the model itself to decide the mapping strategy and execute it, this raises doubts that the perf
1. Two theorems provide the lower bound of the total variation distance between the sample and the required distributions. 2. The experiment section is well presented overall. Figures and tables convinced me that the proposed method indeed improves the LLM's probabilistic instruction-following capabilities. 3. In addition to 2, the rich details in the appendix and the uploaded supporting material benefit the reproduction of numerical studies.
1. The assumptions that "each character is randomly drawn from a distribution" of the proved theorems are hard to meet for LLMs. They are pre-trained on text corpora. Every individual token generated by them is based on its previous content (Causal Language Modeling), thus never random. I can be proved wrong, at least empirically, by an analysis of the distribution/duplication of LLM-generated random strings. 2. Personally, I am not fully motivated to tune LLM to perform PIF tasks, which could
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
