Semantic Probabilistic Control of Language Models

Kareem Ahmed; Catarina G Belem; Padhraic Smyth; Sameer Singh

arXiv:2505.01954·cs.LG·May 6, 2025

Semantic Probabilistic Control of Language Models

Kareem Ahmed, Catarina G Belem, Padhraic Smyth, Sameer Singh

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel semantic control method for language models that uses verifier gradient information to steer generations towards desired attributes like toxicity, sentiment, or politeness, achieving high accuracy without quality loss.

Contribution

It presents a new gradient-based approach for efficient semantic control of language models, overcoming limitations of previous sampling methods and enabling precise attribute steering.

Findings

01

Achieves >95% control accuracy for toxicity, sentiment, and topic adherence.

02

Maintains language model quality while enforcing semantic constraints.

03

Provides a computationally efficient alternative to sampling-based control methods.

Abstract

Semantic control entails steering LM generations towards satisfying subtle non-lexical constraints, e.g., toxicity, sentiment, or politeness, attributes that can be captured by a sequence-level verifier. It can thus be viewed as sampling from the LM distribution conditioned on the target attribute, a computationally intractable problem due to the non-decomposable nature of the verifier. Existing approaches to LM control either only deal with syntactic constraints which cannot capture the aforementioned attributes, or rely on sampling to explore the conditional LM distribution, an ineffective estimator for low-probability events. In this work, we leverage a verifier's gradient information to efficiently reason over all generations that satisfy the target attribute, enabling precise steering of LM generations by reweighing the next-token distribution. Starting from an initial sample, we…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

**Empirical Performance**: The empirical performance does appear quite strong, and the paper demonstrates that this does scale to larger models. **Experiment Design**: The experimental design also seems very strong: the selected prompts for each task (toxicity, sentiment, topic) seem appropriate given the constraint; and the evaluation metrics are reasonable.

Weaknesses

**Framing and Motivation**: The paper is framed as performing exact inference of an approximate distribution, but I feel that this is not quite accurate. Finite time Gibbs is an MCMC method, which converges as the number of sampling steps -> infinity. Pseudo-likelihoods from a masked language model do not necessarily correspond to a valid joint distribution [1]. Thus the gibbs sampling algorithm may not even be sampling from a well defined distribution. The Taylor approximation also introduces a

Reviewer 02Rating 4Confidence 4

Strengths

1. ScoNE is lightweight method that demonstrably outperforms existing baselines on toxicity, sentiment, and topic control. 2. The method is timely and relevant, as it it proposes a way to control generation by LLMs using only their output probabilities.

Weaknesses

A primary weakness of the paper is insufficient contextualization wrt prior work. A missing paper is FUDGE https://arxiv.org/pdf/2104.05218 which also operates by reweighing token probabilities with Bayes rule, via some attribute classifier on future generations. Since FUDGE is very close to your method, it should be discussed up front and tested as a baseline. A second crucial weakness of the paper is the mathematical exposition. The paper was difficult to read, taking many close readings in o

Reviewer 03Rating 2Confidence 5

Strengths

Strengths The experimental design is relatively comprehensive. The authors validate the proposed method across three distinct semantic control tasks (toxicity, sentiment, topic), covering both "constraint satisfaction" (e.g., detoxification, positive sentiment) and "constraint enhancement" (e.g., toxification) scenarios. For each task, multiple base models (Llama-3.2, GPT2-medium) and diverse baselines (training-free methods like random, beamsearch, BoN; training-based methods like PPLM, DExpert

Weaknesses

Critical Weaknesses and Reasons for Rejection 1.Severe Gaps in Related Work Comparison, Leading to Incomplete Contribution Positioning The core idea of this work—Bayesian-based probabilistic inference for LM semantic control—falls into a research paradigm that was extensively explored around 3 years ago, yet the authors fail to compare with key representative works in this field, resulting in an unclear positioning of the work’s novelty. Specifically: Omission of classic Bayesian control methods

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies