Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

Junyoung Koh; Jaeyun Lee; Soo Yong Kim; Gyu Hyeong Choi; Jung In Koh; Jordan Phillips; Yeonjin Lee; Min Song

arXiv:2604.09721·cs.IR·April 14, 2026

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

Junyoung Koh, Jaeyun Lee, Soo Yong Kim, Gyu Hyeong Choi, Jung In Koh, Jordan Phillips, Yeonjin Lee, Min Song

PDF

TL;DR

This paper introduces Jamendo-MT-QA, a new benchmark dataset for multi-track comparative music question answering, enabling evaluation of models' reasoning across multiple music tracks.

Contribution

It creates a large, diverse dataset with multiple question types for multi-track music QA and benchmarks models using both automatic metrics and LLM-based evaluation.

Findings

01

Benchmark includes 36,519 QA items over 12,173 track pairs.

02

Multiple question types: yes/no, short-answer, sentence-level.

03

Evaluation uses both automatic metrics and LLM-based judgment.

Abstract

Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio-language models using both automatic metrics and LLM-as-a-Judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.