TL;DR
MINERVA-Cultural is a new multicultural and multilingual video reasoning benchmark with complex native-language questions, aiming to evaluate and improve long video understanding across diverse cultural contexts.
Contribution
It introduces a culturally diverse, native-language dataset with multi-step reasoning, and proposes a graph-based error analysis method for video reasoning models.
Findings
State-of-the-art models perform significantly below human accuracy.
Errors mainly arise from visual perception of cultural elements.
The benchmark highlights the need for culturally aware video understanding models.
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce MINERVA-Cultural, a challenging benchmark for multicultural and multilingual video reasoning. MINERVA-Cultural comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, MINERVA-Cultural provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on MINERVA-Cultural requires a deeply situated understanding of visual cultural context. Furthermore, we leverage MINERVA-Cultural's reasoning traces to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
