A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Bhuiyan Sanjid Shafique; Ashmal Vayani; Muhammad Maaz; Hanoona Abdul Rasheed; Dinura Dissanayake; Mohammed Irfan Kurpath; Yahya Hmaiti; Go Inoue; Jean Lahoud; Md. Safirur Rashid; Shadid Intisar Quasem; Maheen Fatima; Franco Vidal; Mykola Maslych; Ketan Pravin More; Sanoojan Baliah; Hasindri Watawana; Yuhao Li; Fabian Farestam; Leon Schaller; Roman Tymtsiv; Simon Weber; Hisham Cholakkal; Ivan Laptev; Shin'ichi Satoh; Michael Felsberg; Mubarak Shah; Salman Khan; Fahad Shahbaz Khan

arXiv:2506.07032·cs.CL·October 1, 2025

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces ViMUL-Bench, a multilingual video LMM benchmark across 14 languages and 15 cultural categories, along with a new multilingual video LMM and training dataset to promote cultural and linguistic inclusivity in video understanding.

Contribution

It presents the first comprehensive multilingual video LMM benchmark and a new multilingual video LMM model trained on a large-scale multilingual dataset, enhancing inclusivity in video understanding.

Findings

01

ViMUL-Bench covers 14 languages and 15 cultural categories.

02

The multilingual video LMM outperforms monolingual models in low-resource languages.

03

A large-scale multilingual training set improves model performance across diverse languages.

Abstract

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MBZUAI/ViMUL
model· 2 dl
2 dl

Datasets

MBZUAI/ViMUL-Bench
dataset· 28 dl
28 dl

Videos

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Domain Adaptation and Few-Shot Learning