Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa

TL;DR
This paper introduces rationale-enhanced decoding (RED), a novel inference strategy that significantly improves multi-modal reasoning accuracy and faithfulness in large vision-language models by better integrating visual and rationale information.
Contribution
The paper proposes RED, a new plug-and-play decoding method that enhances multi-modal CoT reasoning by harmonizing visual and rationale information during inference.
Findings
RED consistently outperforms standard CoT and other decoding methods.
RED improves reasoning accuracy across multiple benchmarks and LVLMs.
The approach enhances the faithfulness and reliability of rationale-grounded reasoning.
Abstract
Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
