A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More

TL;DR
This paper introduces ViMUL-Bench, a multilingual video LMM benchmark across 14 languages and 15 cultural categories, along with a new multilingual video LMM and training dataset to promote cultural and linguistic inclusivity in video understanding.
Contribution
It presents the first comprehensive multilingual video LMM benchmark and a new multilingual video LMM model trained on a large-scale multilingual dataset, enhancing inclusivity in video understanding.
Findings
ViMUL-Bench covers 14 languages and 15 cultural categories.
The multilingual video LMM outperforms monolingual models in low-resource languages.
A large-scale multilingual training set improves model performance across diverse languages.
Abstract
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Domain Adaptation and Few-Shot Learning
