Steering the Verifiability of Multimodal AI Hallucinations
Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu, Xuanjing Huang, Yu-Gang Jiang

TL;DR
This paper investigates controlling the verifiability of multimodal AI hallucinations by categorizing them into obvious and elusive types, and proposes an intervention method to regulate their detectability.
Contribution
It introduces a dataset of hallucinations categorized by verifiability and develops an activation-space intervention technique for fine-grained control.
Findings
Intervention probes differ for obvious and elusive hallucinations.
Targeted interventions improve verifiability regulation.
Mixing interventions allows flexible control for various scenarios.
Abstract
AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
