TL;DR
SCoOP is a training-free framework that quantifies uncertainty in multi-Vision-Language Model systems, improving hallucination detection and abstention with minimal overhead.
Contribution
It introduces a novel system-level uncertainty quantification method for multi-VLM systems, enabling effective hallucination detection and abstention without additional training.
Findings
Achieves 0.866 AUROC for hallucination detection, outperforming baselines.
Attains 0.907 AURAC for abstention, surpassing existing methods.
Introduces microsecond-level aggregation overhead, negligible compared to inference time.
Abstract
Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework for multi-VLM systems through uncertainty-weighted linear opinion pooling. The core idea is to treat each VLM as a probabilistic "expert," sample multiple outputs, map them to a unified space, aggregate their opinions, and produce a system-level uncertainty score. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
