MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution
Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian, Liu, Haoyuan Zhang, Zhen Lei

TL;DR
This paper introduces VLF-FFD, a novel vision-language fusion approach utilizing an extended dataset and a specialized network to improve face forgery detection accuracy, outperforming existing methods in various evaluations.
Contribution
The paper presents EFF++, an extended dataset with textual annotations, and VLF-Net, a bidirectional vision-language fusion network, advancing face forgery detection capabilities.
Findings
Achieves state-of-the-art detection accuracy
Effective cross-dataset and intra-dataset performance
Enhanced interpretability through explainability-driven extensions
Abstract
Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Face recognition and analysis
