LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, Boyang Li

TL;DR
This paper introduces Ghost-100, a benchmark for evaluating how vision-language models respond to varying prompt intensities, revealing nuanced hallucination behaviors and model sensitivities.
Contribution
It presents a novel benchmark with a structured prompt intensity framework and dual evaluation metrics to analyze hallucination in vision-language models.
Findings
Models show diverse hallucination responses to prompt pressure.
Some models peak in hallucination at intermediate prompt intensities.
Evaluation metrics reveal dissociation between hallucination rate and confidence.
Abstract
Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families: text-illegibility, time-reading, and object-absence, each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
