Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Sekitoshi Kanai; Tsukasa Yoshida; Hiroshi Takahashi; Haru Kuroki; Kazumune Hashimoto

arXiv:2510.26219·cs.LG·February 13, 2026

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto

PDF

3 Reviews

TL;DR

This paper introduces AISP, a novel test-time alignment method for LLMs that uses sampling-based optimal control in pre-logit space, improving reward maximization without costly fine-tuning.

Contribution

The paper presents a new test-time alignment technique called AISP that leverages importance sampling and stochastic control in pre-logits, outperforming existing reward-based methods.

Findings

01

AISP outperforms best-of-n sampling in reward efficiency.

02

AISP achieves higher rewards than other test-time alignment methods.

03

The method effectively aligns LLMs without fine-tuning.

Abstract

Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is generally well written. The proposed AISP approach operates at inference time and therefore does not require training value functions, unlike RE-Control. 2. The authors provide detailed hyperparameter ablations, a KL-divergence analysis, and a thorough comparison of batched AISP with BoN. 3. The empirical evaluation is comprehensive, covering multiple base LLMs and reward models.

Weaknesses

1. The performance improvement on HH-RLHF appears incremental, and in many cases, BoN outperforms AISP. With only two datasets, it is difficult to fully assess AISP’s empirical effectiveness. I recommend evaluating on additional datasets to more clearly demonstrate the gains. 2. [Minor] While the paper includes strong baselines, adding comparisons with the controlled decoding literature [1, 2] would further strengthen the experimental section. [1] Mudgal, S., Lee, J., Ganapathy, H., Li, Y., Wa

Reviewer 02Rating 4Confidence 4

Strengths

1. This paper maps decoding-time reward maximization to sampling-based optimal control in pre-logit space; derivation via a free-energy lower bound is standard but cleanly presented. 2. The method is training-free and lightweight; also the adaptive importance sampling loop is easy to implement. 3. Empirical results show consistent reward and win-rate improvements over BoN with the same total samples.

Weaknesses

1. The control-theoretic view and MPPI-style derivation are known; the main step is moving importance sampling to pre-logit trajectories with a Gaussian prior. The reduction to BoN for λ→0 underscores AISP as a structured BoN generalization to me rather than a new paradigm. 2. Tasks are preference datasets (SHP, HH-RLHF) with reward-model scoring; diversity/coherence sometimes degrade, and win-rate uses small paired samples. No tests on reasoning/code/math where long-horizon dynamics might stre

Reviewer 03Rating 6Confidence 4

Strengths

- AISP eliminates the need for training and data collection in test-time alignment. - The integration of adaptive importance sampling with model predictive path integral (MPPI) control is novel and well-motivated. - The analysis of modeling pre-logits $z$ as Gaussian distributions and its connection to Best-of-N (BoN) is insightful.

Weaknesses

- AISP introduces numerous hyperparameters, including the standard deviation $\sigma$, the softmax temperature $\lambda$, MPPI coefficient $\alpha$, number of iterations $k$, and window size $\tau$. This complexity limits the practicality of AISP. Moreover, the paper does not sufficiently analyze the sensitivity of these hyperparameters across different tasks and models, or their interactions with standard generation parameters such as temperature, top-$p$, and top-$k$ sampling. - Although AISP

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.