TL;DR
SIEVES introduces a visual evidence scoring method for selective prediction in multimodal large language models, significantly improving out-of-distribution coverage and enabling transfer to proprietary reasoners without internal confidence signals.
Contribution
It proposes a novel selector that estimates localization quality using only model inputs and outputs, enhancing generalization and transferability in visual question answering.
Findings
Coverage improved up to three times on OOD benchmarks.
Enables transfer to proprietary reasoners without access to internal signals.
Generalizes across multiple benchmarks and reasoner models.
Abstract
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
