MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

Siran Peng; Zipei Wang; Li Gao; Xiangyu Zhu; Tianshuo Zhang; Ajian; Liu; Haoyuan Zhang; Zhen Lei

arXiv:2505.02013·cs.CV·May 6, 2025

MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian, Liu, Haoyuan Zhang, Zhen Lei

PDF

Open Access

TL;DR

This paper introduces VLF-FFD, a novel vision-language fusion approach utilizing an extended dataset and a specialized network to improve face forgery detection accuracy, outperforming existing methods in various evaluations.

Contribution

The paper presents EFF++, an extended dataset with textual annotations, and VLF-Net, a bidirectional vision-language fusion network, advancing face forgery detection capabilities.

Findings

01

Achieves state-of-the-art detection accuracy

02

Effective cross-dataset and intra-dataset performance

03

Enhanced interpretability through explainability-driven extensions

Abstract

Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Face recognition and analysis