Debating for Better Reasoning: An Unsupervised Multimodal Approach
Ashutosh Adhikari, Mirella Lapata

TL;DR
This paper introduces a multimodal debate framework where vision-language models debate answers, and a text-only judge evaluates them, leading to improved performance and reasoning in models, especially for visual question answering tasks.
Contribution
It extends the debate paradigm to multimodal settings, enabling weaker models to supervise and enhance stronger models' reasoning capabilities.
Findings
Debate framework outperforms individual models on multimodal tasks.
Weaker LLM judgments can improve vision-language model reasoning.
The approach reduces reliance on explicit role-playing in debates.
Abstract
As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsFocus
