Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao (1); Yuqi Li (1; 2); Yunpeng Luo (1); Jianjun Yin (2); Long Ma (1) ((1) Tencent YouTu Lab; China; (2) Fudan University; China)

arXiv:2602.23393·cs.SD·March 2, 2026

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao (1), Yuqi Li (1, 2), Yunpeng Luo (1), Jianjun Yin (2), Long Ma (1) ((1) Tencent YouTu Lab, China, (2) Fudan University, China)

PDF

Open Access

TL;DR

This paper introduces AV-LMMDetect, a large multimodal model fine-tuned for audio-visual deepfake detection, which outperforms prior models and demonstrates strong generalization across datasets.

Contribution

The paper presents AV-LMMDetect, a large multimodal model that effectively detects deepfakes by jointly analyzing audio and visual streams, improving upon existing small, task-specific detectors.

Findings

01

Matches or surpasses prior methods on FakeAVCeleb and Mavos-DD datasets.

02

Sets a new state of the art on Mavos-DD datasets.

03

Demonstrates strong cross-domain generalization.

Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing