Tell me Habibi, is it Real or Fake?
Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muhammad Haris Khan, Abhinav Dhall

TL;DR
This paper introduces ArEnAV, a large-scale Arabic-English audio-visual deepfake dataset with code-switching, dialectal variation, and monolingual content, to improve detection of multilingual deepfakes.
Contribution
It presents the first extensive Arabic-English deepfake dataset with intra-utterance code-switching, generated via a novel pipeline, and benchmarks its effectiveness against existing datasets and models.
Findings
The dataset contains 387k videos and 765 hours of data.
Benchmark results show the dataset's potential to enhance deepfake detection.
Human evaluation highlights challenges in multilingual deepfake detection.
Abstract
Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce \textbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It \textbf{contains 387k videos and over 765 hours of real and fake videos}. Our…
Peer Reviews
Decision·ICLR 2026 Poster
1. Existing deepfake datasets are monolingual or multilingual but lack intra-utterance code-switching. The paper clearly identifies this gap and addresses it convincingly. 2. Large-scale, well-engineered dataset. 387K videos, 4 TTS + 2 lip-sync models, stratified splits, strong statistics, and detailed generation pipeline. The dataset is significantly larger and more diverse than prior multilingual datasets. The authors show that state-of-the-art models (e.g., BA-TFD, LipForensics, Capsule-v2) p
1. The paper does not propose any new detection model or algorithm. I feel the work as “engineering + dataset release” rather than a scientific advance. 2. Heavy reliance on closed-source models (GPT-4.1, Whisper, TTS-1, etc.). Reproducibility is partially limited. If OpenAI APIs change, future users may not be able to regenerate the dataset. This may be flagged in the reproducibility checklist. 3. Although CSW is the main motivation, the paper lacks deeper linguistic validation: ⦁ Is the LLM-ge
- It will be released as an open-source, large-scale bilingual dataset. Given that Arabic is spoken by hundreds of millions of people all over the world, the dataset holds significant importance. - The data generation pipeline was clearly described in the paper, enabling easy reproducibility. - The quality of the generated fake data was comparable with the well known dataset AV-Deepfake1M, as evaluated by standard metrics.
1. Insufficient Direct Experimental Evidence for the "Code-Switching" Contribution The paper's central contribution is its focus on a multilingual and code-switching (CSW) dataset. However, the experimental results presented in Tables 8, 9, and 10a do not directly prove that this code-switching characteristic is the key factor driving the dataset's difficulty. The authors demonstrate that existing models perform poorly and attribute this failure to the novelty of CSW. Specifically, the zero-sho
1. This work successfully tackles an important and overlooked problem: detecting audio-visual deepfakes in code-switched (CSW) speech. This is a major step towards building deepfake detectors that work in the real world. 2. This work proposes ArEnAV, a new large-scale dataset for this task. The pipeline used to create the data is novel and combines several SOTA models, providing a valuable new resource for the research community.
1. The primary evaluation metric (AP@IoU=0.5) may be poorly suited for the dataset's extremely short, single-word forgeries. 2. The "TTS and insert" audio generation method can create unnatural splice artifacts, which may affect the dataset's validity. These artifacts could allow models to detect forgeries using simple audio errors rather than the intended code-switching cues, thus misrepresenting the true nature of the detection challenge.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTurkey's Politics and Society · Islamic Studies and History
