Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Ma\~nas; Pierluca D'Oro; Koustuv Sinha; Adriana Romero-Soriano; Michal Drozdzal; Aishwarya Agrawal

arXiv:2508.11616·cs.CV·August 18, 2025

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Ma\~nas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal

PDF

TL;DR

This paper introduces a novel reward-guided decoding method for Multimodal Large Language Models, enabling dynamic control over visual grounding and hallucination mitigation during inference, with improved performance on standard benchmarks.

Contribution

It presents the first reward-guided decoding approach for MLLMs, allowing real-time control over object grounding precision, recall, and computational trade-offs.

Findings

01

Enhanced controllability over MLLM inference.

02

Significant reduction in object hallucinations.

03

Outperforms existing hallucination mitigation methods.

Abstract

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.