TL;DR
This paper introduces Mega-MMDF, a large and diverse multimodal deepfake dataset, and DeepfakeBench-MM, a comprehensive benchmark platform for evaluating multimodal deepfake detection methods, addressing current data and standardization gaps.
Contribution
It provides the first large-scale, diverse dataset and a standardized benchmark platform for multimodal deepfake detection, facilitating future research and method evaluation.
Findings
DeepfakeBench-MM supports 5 datasets and 11 detectors.
Comprehensive evaluations reveal insights into augmentation and forgery strategies.
Mega-MMDF contains 1.2 million samples, making it one of the largest datasets.
Abstract
The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The primary strength is the creation and open-sourcing of Mega-MMDF. A dataset of this scale (1.1M fake samples) and documented diversity (21 pipelines) is a substantial contribution that will undoubtedly fuel future research. 2. The multimodal deepfake field lacks standardized evaluation before. Author’s DeepfakeBench-MM provides a much-needed, unified, and extensible platform for fair comparison, which is critical for measuring real progress. The benchmarking of 11 detectors across 5 data
1. Authors slightly hide their dataset outerlink in the anonymous github, and linked page just reveal author’s information including names, university, and the fact that this paper was double-submitted to NIPS benchmark and dataset. : https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/J4DVAA) 2. The paper's "key findings" (Sec 4.3) are presented as novel contributions, but they are largely well-known already. For instance, Analysis 4 (Modality Bias) is a widely docume
1. Lack of standardized evaluation is a long standing issue in the community. The authors are addressing a real research gap 2. The authors share the code and also mention "continuous expansion", showcasing their commitment to maintain a comprehensive Multimodal Deepfake Detection benchmark. 3. The diversity of Mega-MMDF is impressive.
1. The authors need to demonstrate that training on Mega-MMDF leads to better numbers on all test sets. The paper is missing performance comparison where models are trained on other recent train sets. I am still not convinced that the quality of the train set is good enough. I am not able to conclude anything from Table 2. 2. The authors should include a failure case analysis to showcase the model trained using Mega-MMDF fails in which cases. 3. The models trained on Mega-MMDF performs good on
1. Mega-MMDF is one of the largest and most diverse multimodal deepfake datasets, substantially surpassing prior datasets in both the number of forgery methods and overall sample size. 2. The dataset construction includes an elaborate, multi-stage quality assessment for audio, video, and synchronization. 3. The evaluations are exhaustive, covering intra-dataset, cross-dataset, and cross-pipeline detection.
1. The paper’s primary technical contributions rest in data and benchmark construction, not in algorithmic advances or new detection paradigms. While infrastructure is critical, there is minimal advancement on detection methodology itself. 2. Although Mega-MMDF boasts scale and diversity, the potential for overfitting to known or compositional artifacts is briefly mentioned but lacks rigorous quantitative measures of “wildness” versus real-world deepfake complexity. 3. The benchmark focuses on
The manuscript constructs a large-scale multimodal deepfake dataset and proposes a multimodal deepfake detection benchmark, advancing the development of multimodal detection. It is well-organized and clearly presents the limitations of existing research and the strengths of the proposed approach.
(1) The authors need to provide a more detailed description of the content in Figure 1 to better highlight the advantages of the proposed dataset. (2) The authors should clarify how the thresholds for each metric in Section 3.3 were determined. (3) In Section 3.3, the STT model WhisperX is used to evaluate audio fidelity. What specific metric is employed for this assessment? (4) [1] and [2] also propose ensemble approaches, how does the ensemble model in Section 4.1 differ from theirs? (5) In An
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
