MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Florinel-Alin Croitoru; Vlad Hondru; Marius Popescu; Radu Tudor Ionescu; Fahad Shahbaz Khan; Mubarak Shah

arXiv:2505.11109·cs.CV·May 19, 2025

MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah

PDF

Open Access 1 Datasets 5 Reviews

TL;DR

This paper introduces MAVOS-DD, a large-scale multilingual open-set benchmark dataset for audio-video deepfake detection, highlighting the challenges of generalizing detectors across unseen languages and deepfake generation models.

Contribution

It provides the first extensive multilingual open-set benchmark for deepfake detection with diverse data and evaluation setups, facilitating future research.

Findings

01

State-of-the-art detectors struggle with open-set scenarios.

02

Performance drops significantly on unseen languages and deepfake models.

03

The dataset enables robust evaluation of deepfake detection methods.

Abstract

We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at:…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. MAVOS-DD is a benchmark to offer explicitly defined training, validation, and four testing splits that jointly vary across languages and generative models. 2. The dataset spans eight languages with relatively balanced distributions and integrates both visual and auditory manipulations, making it broader than previous resources such as FakeAVCeleb or PolyGlotFake. 3. The authors conduct detailed analyses including model ablations, language wise results, and audio video synchronization tests,

Weaknesses

1. The "multilingual" aspect, as a key novelty, is an incremental contribution given the emergence of other benchmarks. 2. The evaluation is entirely self-contained within MAVOS-DD, and all results are based on models fine-tuned and tested on the proposed dataset. The paper does not examine how the fine-tuned models perform on existing benchmarks such as Deepfake-Eval-2024 or PolyGlotFake. Such cross-dataset validation would provide a broader form of open-set evaluation, verifying whether the p

Reviewer 02Rating 4Confidence 4

Strengths

The paper demonstrates that current audio-visual deepfake detection methods still exhibit insufficient generalization in open-world scenarios, which is a particularly important and often overlooked issue.

Weaknesses

1 The definition of “open-set” essentially refers to domain mismatch—that is, generalization. The key point is how to generalize to unseen attacks, unseen languages, etc. Therefore, the authors should emphasize this aspect more clearly. Using the term “open-set” alone may mislead readers into thinking it refers to environmental variations in real-world testing. 2 The paper only verifies its claims through performance degradation in experiments, but lacks deeper analysis. For example, can the au

Reviewer 03Rating 4Confidence 3

Strengths

-First truly multilingual open-set benchmark for multimodal deepfake detection. -Large-scale, balanced dataset covering diverse generation methods. -Clear experimental protocol and significant empirical findings on generalization limits. -Public release of data and code promotes reproducibility.

Weaknesses

A notable limitation of the paper is the underexplored multilingual aspect. Although MAVOS-DD includes eight languages, the experiments do not analyze performance across different linguistic groups or phonetic structures. There is no discussion of how language-related features (e.g., tonal versus non-tonal prosody, articulation speed, or lip movement diversity) might affect audio–visual coherence and thus detection difficulty. As a result, the “multilingual” claim feels primarily structural rath

Reviewer 04Rating 2Confidence 4

Strengths

1. The dataset is designed using open-set splits, which mimics real-world deployment scenarios. 2. The created dataset is comprehensive, encompassing more than 300 hours in length and 8 languages. 3. Utilizes recent deepfake mechanisms in a targeted manner to make comprehensive deepfake examples.

Weaknesses

1. The benchmark is conducted on three models (two multimodal and one video only). This is fairely limited evaluation, and there should have been more benchmarking experiments. Some relevant papers are [1][2][3]. Furthermore, the scores in the in-domain set and open-set full sets reach 0.96 mAP and 0.9 mAP respectively. So how challenging is this dataset compared to existing datasets, and what is the exact nature of these challenges? 2. The novelty contributions of the dataset is limited to the

Reviewer 05Rating 6Confidence 3

Strengths

This paper precisely identifies a core challenge in deepfake detection—model generalization. Most existing studies evaluate under closed-set assumptions, leading to inflated performance metrics that fail to reflect how models would perform in real-world, unpredictable environments. By introducing two key variables—"unseen generation models" and "unseen languages"—MAVOS-DD provides the research community with a highly valuable evaluation benchmark that more closely mirrors real-world challenges.

Weaknesses

The authors mention fine-tuning the EchoMimic model to adapt it to new languages, but this is only briefly stated as “using 1,000 videos and training for 10 epochs.” Crucially missing are details about the network architecture, loss function, optimizer, learning rate, and other key hyperparameters used during fine-tuning. This lack of information prevents others from reproducing samples with the same artifact characteristics. Due to quality concerns, the authors imposed strict constraints on the

Code & Models

Datasets

unibuc-cs/MAVOS-DD
dataset· 12k dl
12k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Image and Video Quality Assessment