Token-Level Inference-Time Alignment for Vision-Language Models

Kejia Chen; Jiawen Zhang; Jiacong Hu; Kewei Gao; Jian Lou; Zunlei Feng; Mingli Song

arXiv:2510.21794·cs.CV·October 28, 2025

Token-Level Inference-Time Alignment for Vision-Language Models

Kejia Chen, Jiawen Zhang, Jiacong Hu, Kewei Gao, Jian Lou, Zunlei Feng, Mingli Song

PDF

3 Reviews

TL;DR

This paper introduces TITA, a lightweight inference-time alignment method for vision-language models that improves accuracy and reduces hallucinations without retraining the backbone, by using token-level feedback derived from a reward model.

Contribution

TITA provides a novel token-level inference-time alignment framework that enhances VLM performance and reduces hallucinations without expensive fine-tuning.

Findings

01

Achieves 8.6% improvement on MMVet benchmark.

02

Reduces hallucinations and improves VQA accuracy.

03

Incur negligible inference overhead.

Abstract

Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

- Clear and Well-Structured: The paper is well-organized, with detailed explanations of the preliminary, intuition, and methodology. - Superiority in Alignment: The experimental results demonstrate that the proposed method achieves the overall best performance on the general VQA and hallucination benchmarks compared to the baselines.

Weaknesses

- The backbones used in the experiments are somewhat outdated, particularly since the main results presented in Table 2 are based on the LLaVA 1.5 series models. While I acknowledge that the authors also provide results using Qwen-2.5-VL and DeepSeek-VL2, a more comprehensive evaluation using such recent and stronger VLMs would strengthen the manuscript. - As a highly competitive and rapidly evolving research area, VLM alignment should provide evaluation against up-to-date methods and backbones

Reviewer 02Rating 4Confidence 4

Strengths

(1) TITA innovatively transforms sequence-level rewards into token-level signals, addressing the issues of feedback delay and high computational cost in existing methods. By directly guiding the decoding process without the need for sequence re-ranking, it enables timely intervention against hallucinations with extremely low training cost. (2) TITA is a plug-and-play method that does not modify the parameters of the base model, giving it strong generality and allowing it to be flexibly applied

Weaknesses

(1) TITA relies on image augmentation and response fusion to generate the “winning” responses. This mechanism primarily captures the comprehensiveness of visual elements, which may make it difficult to learn deeper semantic or complex reasoning errors that cause hallucinations in VLMs. As a result, the reward model may be limited in capturing more sophisticated preference patterns. (2) The proposed method is highly sensitive to the scaling factor lambda. As shown in Figure 3 of the paper, the p

Reviewer 03Rating 6Confidence 2

Strengths

1. The idea of bringing Direct Preference Optimization into inference-time, at the token level, is both conceptually neat and practical. It bridges the gap between coarse sequence-level feedback and expensive retraining. 2. The experiments are broad (12 benchmarks, several VLM families) and show clear, consistent gains in hallucination suppression and visual reasoning accuracy with very low additional cost. 3.The figures and algorithm explanations are intuitive; the comparisons with prior traini

Weaknesses

1. It would be valuable to analyze how the reward model’s scale or quality influences performance — for example, comparing smaller versus larger reward models to verify robustness of token-level alignment. 2. While the paper shows cross-model adaptability (7B to 27B), it would be insightful to analyze how reward model quality affects alignment. For instance, does using a smaller or noisier reward model degrade token-level signals significantly?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.