Detecting Multimodal Situations with Insufficient Context and Abstaining   from Baseless Predictions

Junzhang Liu; Zhecan Wang; Hammad Ayyubi; Haoxuan You; Chris Thomas,; Rui Sun; Shih-Fu Chang; Kai-Wei Chang

arXiv:2405.11145·cs.CV·April 1, 2025

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas,, Rui Sun, Shih-Fu Chang, Kai-Wei Chang

PDF

Open Access

TL;DR

This paper introduces a method to detect and abstain from answering in vision-language tasks when the context is insufficient, improving model trustworthiness and reducing hallucinations.

Contribution

It proposes a novel context-aware abstention detector (CARA) and a context selection module to enhance evidence-based predictions in vision-language understanding benchmarks.

Findings

01

CARA generalizes well to unseen benchmarks

02

Significant accuracy improvements with context-aware abstention

03

Curated CASE set for benchmarking insufficient context detection

Abstract

Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. Training models on such data foster biased learning and hallucinations as models tend to make similar unwarranted assumptions. To address this issue, we collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. Strong improvements across multiple benchmarks demonstrate the effectiveness of our approach. Further, we develop a general-purpose Context-AwaRe Abstention (CARA) detector to identify samples lacking sufficient context and enhance model accuracy by abstaining from responding if the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsSparse Evolutionary Training