KARMA-MV: A Benchmark for Causal Question Answering on Music Videos
Archishman Ghosh, Abhinaba Roy, Dorien Herremans

TL;DR
KARMA-MV introduces a large-scale dataset and a causal knowledge graph approach to enhance causal reasoning in music video question answering, emphasizing the importance of explicit causal structures.
Contribution
The paper presents KARMA-MV, a novel benchmark dataset for causal reasoning in music videos, and proposes a causal knowledge graph method to improve vision-language model performance.
Findings
CKG improves reasoning accuracy, especially for smaller models.
Models benefit from explicit causal structure grounding.
KARMA-MV enables scalable causal question answering in music videos.
Abstract
While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
