HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models

Sushant Gautam; Michael A. Riegler; P{\aa}l Halvorsen

arXiv:2511.12693·cs.CV·November 18, 2025

HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models

Sushant Gautam, Michael A. Riegler, P{\aa}l Halvorsen

PDF

Open Access

TL;DR

HEDGE introduces a unified, geometry-based framework for detecting hallucinations in vision-language models, leveraging perturbations, clustering, and uncertainty metrics to improve reliability assessment across architectures.

Contribution

The paper presents HEDGE, a novel, reproducible pipeline combining visual perturbations, semantic clustering, and robust metrics for hallucination detection in multimodal models.

Findings

01

Dense visual tokenization models show higher hallucination detectability.

02

Embedding-based clustering outperforms NLI-based clustering for answer separation.

03

VASE metric provides consistent hallucination signals across configurations.

Abstract

Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Face Recognition and Perception · Adversarial Robustness in Machine Learning