KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

Archishman Ghosh; Abhinaba Roy; Dorien Herremans

arXiv:2605.08175·cs.CV·May 12, 2026

KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

Archishman Ghosh, Abhinaba Roy, Dorien Herremans

PDF

TL;DR

KARMA-MV introduces a large-scale dataset and a causal knowledge graph approach to enhance causal reasoning in music video question answering, emphasizing the importance of explicit causal structures.

Contribution

The paper presents KARMA-MV, a novel benchmark dataset for causal reasoning in music videos, and proposes a causal knowledge graph method to improve vision-language model performance.

Findings

01

CKG improves reasoning accuracy, especially for smaller models.

02

Models benefit from explicit causal structure grounding.

03

KARMA-MV enables scalable causal question answering in music videos.

Abstract

While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.