CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Emilio Villa-Cueva; Sholpan Bolatzhanova; Diana Turmakhan; Kareem Elzeky; Henok Biadglign Ademtew; Alham Fikri Aji; Vladimir Araujo; Israel Abebe Azime; Jinheon Baek; Frederico Belcavello; Fermin Cristobal; Jan Christian Blaise Cruz; Mary Dabre; Raj Dabre; Toqeer Ehsan; Naome A Etori; Fauzan Farooqui; Jiahui Geng; Guido Ivetta; Thanmay Jayakumar; Soyeong Jeong; Zheng Wei Lim; Aishik Mandal; Sofia Martinelli; Mihail Minkov Mihaylov; Daniil Orel; Aniket Pramanick; Sukannya Purkayastha; Israfel Salazar; Haiyue Song; Tiago Timponi Torrent; Debela Desalegn Yadeta; Injy Hamed; Atnafu Lambebo Tonja; Thamar Solorio

arXiv:2505.24456·cs.CL·September 23, 2025

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Emilio Villa-Cueva, Sholpan Bolatzhanova, Diana Turmakhan, Kareem Elzeky, Henok Biadglign Ademtew, Alham Fikri Aji, Vladimir Araujo, Israel Abebe Azime, Jinheon Baek, Frederico Belcavello, Fermin Cristobal, Jan Christian Blaise Cruz, Mary Dabre, Raj Dabre, Toqeer Ehsan

PDF

Open Access 1 Datasets 2 Videos

TL;DR

This paper introduces CaMMT, a benchmark dataset with images and captions in multiple languages, to evaluate how visual context can improve culturally aware machine translation, especially for region-specific content.

Contribution

The paper presents CaMMT, a new benchmark dataset for multimodal translation that incorporates cultural context via images, and evaluates vision-language models on this dataset.

Findings

01

Visual context improves translation quality for culturally-specific items.

02

Images help disambiguate meanings and improve gender accuracy.

03

Multimodal models outperform text-only models in cultural translation tasks.

Abstract

Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

villacu/cammt
dataset· 89 dl
89 dl

Videos

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling