Leveraging large multimodal models for audio-video deepfake detection: a pilot study
Songjun Cao (1), Yuqi Li (1, 2), Yunpeng Luo (1), Jianjun Yin (2), Long Ma (1) ((1) Tencent YouTu Lab, China, (2) Fudan University, China)

TL;DR
This paper introduces AV-LMMDetect, a large multimodal model fine-tuned for audio-visual deepfake detection, which outperforms prior models and demonstrates strong generalization across datasets.
Contribution
The paper presents AV-LMMDetect, a large multimodal model that effectively detects deepfakes by jointly analyzing audio and visual streams, improving upon existing small, task-specific detectors.
Findings
Matches or surpasses prior methods on FakeAVCeleb and Mavos-DD datasets.
Sets a new state of the art on Mavos-DD datasets.
Demonstrates strong cross-domain generalization.
Abstract
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing
