ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large   Multimodal Models

Vipula Rawte; Sarthak Jain; Aarush Sinha; Garv Kaushik; Aman Bansal,; Prathiksha Rumale Vishwanath; Samyak Rajesh Jain; Aishwarya Naresh Reganti,; Vinija Jain; Aman Chadha; Amit P. Sheth; Amitava Das

arXiv:2411.10867·cs.CV·March 21, 2025

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal,, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti,, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ViBe, a large-scale benchmark dataset of hallucinated videos from Text-to-Video models, to evaluate and improve the detection of AI-generated inconsistencies in video outputs.

Contribution

The paper presents a new dataset and classification framework for identifying hallucinations in T2V models, highlighting the challenge of automated hallucination detection.

Findings

01

Established baseline classification performance with TimeSFormer + CNN ensemble.

02

Identified five major hallucination types in T2V outputs.

03

Demonstrated modest accuracy of current detection methods, emphasizing need for improvement.

Abstract

Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe (https://vibe-t2v-bench.github.io/): a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ViBe-T2V-Bench/ViBe
dataset· 303 dl
303 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health Research Topics · Mental Health and Psychiatry · Psychedelics and Drug Studies