Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang; Wei-Yuan Cheng; Chi-Pin Huang; Fu-En Yang; Yu-Chiang Frank Wang

arXiv:2512.04356·cs.CV·December 5, 2025

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang

PDF

Open Access

TL;DR

This paper introduces SANTA, a framework that reduces hallucinations in multimodal language models by identifying false correlations and aligning visual facts with captions, improving factual accuracy in video descriptions.

Contribution

The paper presents a novel self-augmented contrastive alignment method specifically designed to mitigate object and action hallucinations in dynamic video captioning models.

Findings

01

SANTA significantly reduces hallucinations in generated video captions.

02

The method outperforms existing approaches on hallucination benchmarks.

03

Enhanced alignment improves factual consistency in multimodal descriptions.

Abstract

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition