Semantic Probabilistic Control of Language Models
Kareem Ahmed, Catarina G Belem, Padhraic Smyth, Sameer Singh

TL;DR
This paper introduces a novel semantic control method for language models that uses verifier gradient information to steer generations towards desired attributes like toxicity, sentiment, or politeness, achieving high accuracy without quality loss.
Contribution
It presents a new gradient-based approach for efficient semantic control of language models, overcoming limitations of previous sampling methods and enabling precise attribute steering.
Findings
Achieves >95% control accuracy for toxicity, sentiment, and topic adherence.
Maintains language model quality while enforcing semantic constraints.
Provides a computationally efficient alternative to sampling-based control methods.
Abstract
Semantic control entails steering LM generations towards satisfying subtle non-lexical constraints, e.g., toxicity, sentiment, or politeness, attributes that can be captured by a sequence-level verifier. It can thus be viewed as sampling from the LM distribution conditioned on the target attribute, a computationally intractable problem due to the non-decomposable nature of the verifier. Existing approaches to LM control either only deal with syntactic constraints which cannot capture the aforementioned attributes, or rely on sampling to explore the conditional LM distribution, an ineffective estimator for low-probability events. In this work, we leverage a verifier's gradient information to efficiently reason over all generations that satisfy the target attribute, enabling precise steering of LM generations by reweighing the next-token distribution. Starting from an initial sample, we…
Peer Reviews
Decision·Submitted to ICLR 2026
**Empirical Performance**: The empirical performance does appear quite strong, and the paper demonstrates that this does scale to larger models. **Experiment Design**: The experimental design also seems very strong: the selected prompts for each task (toxicity, sentiment, topic) seem appropriate given the constraint; and the evaluation metrics are reasonable.
**Framing and Motivation**: The paper is framed as performing exact inference of an approximate distribution, but I feel that this is not quite accurate. Finite time Gibbs is an MCMC method, which converges as the number of sampling steps -> infinity. Pseudo-likelihoods from a masked language model do not necessarily correspond to a valid joint distribution [1]. Thus the gibbs sampling algorithm may not even be sampling from a well defined distribution. The Taylor approximation also introduces a
1. ScoNE is lightweight method that demonstrably outperforms existing baselines on toxicity, sentiment, and topic control. 2. The method is timely and relevant, as it it proposes a way to control generation by LLMs using only their output probabilities.
A primary weakness of the paper is insufficient contextualization wrt prior work. A missing paper is FUDGE https://arxiv.org/pdf/2104.05218 which also operates by reweighing token probabilities with Bayes rule, via some attribute classifier on future generations. Since FUDGE is very close to your method, it should be discussed up front and tested as a baseline. A second crucial weakness of the paper is the mathematical exposition. The paper was difficult to read, taking many close readings in o
Strengths The experimental design is relatively comprehensive. The authors validate the proposed method across three distinct semantic control tasks (toxicity, sentiment, topic), covering both "constraint satisfaction" (e.g., detoxification, positive sentiment) and "constraint enhancement" (e.g., toxification) scenarios. For each task, multiple base models (Llama-3.2, GPT2-medium) and diverse baselines (training-free methods like random, beamsearch, BoN; training-based methods like PPLM, DExpert
Critical Weaknesses and Reasons for Rejection 1.Severe Gaps in Related Work Comparison, Leading to Incomplete Contribution Positioning The core idea of this work—Bayesian-based probabilistic inference for LM semantic control—falls into a research paradigm that was extensively explored around 3 years ago, yet the authors fail to compare with key representative works in this field, resulting in an unclear positioning of the work’s novelty. Specifically: Omission of classic Bayesian control methods
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
