FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

Liqiang Jing; Viet Lai; Seunghyun Yoon; Trung Bui; Xinya Du

arXiv:2507.06523·cs.CV·July 10, 2025

FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du

PDF

Open Access

TL;DR

FIFA is a comprehensive evaluation framework for assessing the factual accuracy of VideoMLLMs in both video-to-text and text-to-video tasks, addressing hallucinations and aligning better with human judgment.

Contribution

The paper introduces FIFA, a unified framework for evaluating faithfulness in video and text generation, and proposes Post-Correction to improve factual consistency.

Findings

01

FIFA aligns more closely with human judgment than existing methods.

02

Post-Correction effectively reduces hallucinations in generated content.

03

FIFA provides a comprehensive assessment across multiple tasks.

Abstract

Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling