Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Ma\~nas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal

TL;DR
This paper introduces a novel reward-guided decoding method for Multimodal Large Language Models, enabling dynamic control over visual grounding and hallucination mitigation during inference, with improved performance on standard benchmarks.
Contribution
It presents the first reward-guided decoding approach for MLLMs, allowing real-time control over object grounding precision, recall, and computational trade-offs.
Findings
Enhanced controllability over MLLM inference.
Significant reduction in object hallucinations.
Outperforms existing hallucination mitigation methods.
Abstract
As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
