Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee

TL;DR
This paper proposes a self-captioning approach with a Multimodal Interaction Gate to amplify shared information between modalities, reducing hallucinations and improving robustness in vision-language models.
Contribution
It introduces a novel self-captioning workflow that converts unique modality interactions into redundant ones, enhancing model reliability.
Findings
Redundancy amplification reduces visual hallucinations by 38.3%.
Model consistency improves by 16.8% with increased redundancy.
The Multimodal Interaction Gate effectively converts unique interactions into shared information.
Abstract
Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
