Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Selim Furkan Tekin; Yichang Xu; Gaowen Liu; Ramana Rao Kompella; Margaret L. Loper; Ling Liu

arXiv:2603.12669·cs.CV·March 16, 2026

Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu

PDF

Open Access

TL;DR

V3Fusion introduces a novel ensemble method for vision-language models that uses focal error diversity and a genetic algorithm to select and fuse models, significantly improving visual reasoning accuracy and robustness.

Contribution

The paper proposes a new fusion approach combining focal error diversity and CKA-based metrics with genetic algorithm pruning for effective VLM ensemble selection.

Findings

01

Outperforms individual VLMs on multiple benchmarks by up to 8.09% accuracy.

02

Effectively mitigates hallucinations and captures epistemic uncertainty.

03

Demonstrates robustness even without majority consensus.

Abstract

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning