MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Sara Papi, Maike Z\"ufle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues

TL;DR
MCIF is a comprehensive, human-annotated benchmark designed to evaluate crosslingual and multimodal instruction-following capabilities of large language models across diverse tasks, modalities, and languages, based on scientific talks.
Contribution
It introduces the first benchmark that jointly assesses crosslingual, multimodal instruction following with human annotations across multiple languages and modalities.
Findings
Universal challenges identified across models and tasks.
Significant room for improvement in multimodal, crosslingual instruction following.
Benchmark covers four macro-tasks and three modalities.
Abstract
Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles a critical gap in evaluating instruction-following across multiple languages and modalities - an under-explored but essential part of MLLMs. 2. MCIF is carefully constructed with parallel multimodal and multilingual alignment, allowing systematic and fair cross-dimensional comparisons—something absent in prior benchmarks. Using human-curated data from scientific talks ensures higher linguistic and contextual richness than synthetic or crowdsourced datasets. The academic dom
1. Scale issue: despite being well-curated, MCIF covers only ~10 hours of content and 21 talks, which may limit statistical robustness and task diversity. 2. Human baseline: it is better to establish a human baseline, making it clear how far current systems are from “human-level” multimodal understanding. 3. Evaluation metrics: Automatic evaluation on WER, COMET, and BERTScore may not well capture multimodal grounding or factual consistency - especially for tasks like question answering or sum
* The construction of the benchmark heavily relies on human experts, including domain experts of the material used, for ground truth annotation. It is really great to see such a rigorous collection protocol. * The proposed task metrics (WER, BertScore, and COMET) are widely used and can be computed without relying on external APIs that may change over time. * The paper presents results on their benchmark across 23 different models from various model families.
* The proposed benchmark is cross-lingual only from English into one of the three alternative languages (German, Italian, Chinese). Input audio and video (e.g. slides) are always English. This may limit the utility of the work for evaluating multi-lingual models’ capabilities in practice. * While the work is clearly describing their contribution to be a benchmark about scientific talks, the data used seems quite narrow even within that domain; The benchmark exclusively uses recordings from prese
1) The idea is very timely. Multimodal and multilingual instruction following is clearly the next big step for LLMs. 2) The dataset design is thoughtful. Everything is aligned across languages and modalities, which makes comparisons fair and controlled. 3) It is fully human-annotated. The effort to manually create transcripts, translations, and QA pairs really increases reliability. 4) The analysis is broad and systematic. The paper looks at different model types, context lengths, and prompt
1) The dataset size feels small compared to other multimodal benchmarks. About ten hours of content may not capture much variation in topics or speakers. 2) The evaluation relies mostly on automatic metrics. Some human checks or qualitative examples would make the results more convincing. 3) The discussion could go a bit deeper on why models fail.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification
