Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning
Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu

TL;DR
V3Fusion introduces a novel ensemble method for vision-language models that uses focal error diversity and a genetic algorithm to select and fuse models, significantly improving visual reasoning accuracy and robustness.
Contribution
The paper proposes a new fusion approach combining focal error diversity and CKA-based metrics with genetic algorithm pruning for effective VLM ensemble selection.
Findings
Outperforms individual VLMs on multiple benchmarks by up to 8.09% accuracy.
Effectively mitigates hallucinations and captures epistemic uncertainty.
Demonstrates robustness even without majority consensus.
Abstract
With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
