AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs
Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao, Tianfei Zhou

TL;DR
AbductiveMLLM enhances visual abductive reasoning in multimodal large language models by integrating verbal hypothesis pruning and pictorial scene imagining, leading to state-of-the-art results on VAR benchmarks.
Contribution
This paper introduces a novel dual-mode framework with REASONER and IMAGINER components to improve abductive inference in MLLMs, inspired by human cognition.
Findings
Achieves state-of-the-art performance on VAR benchmarks.
Outperforms traditional solutions and advanced MLLMs.
Demonstrates effective integration of verbal and pictorial reasoning.
Abstract
Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
