WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction

Jakaria Islam Emon; Kazi Tamanna Alam; Md. Abu Salek

arXiv:2506.05899·cs.SD·June 9, 2025

WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction

Jakaria Islam Emon, Kazi Tamanna Alam, Md. Abu Salek

PDF

Open Access

TL;DR

WhisQ is a novel multimodal model that predicts music quality and text alignment by combining sequence co-attention and optimal transport, significantly improving evaluation accuracy in text-to-music systems.

Contribution

The paper introduces WhisQ, a new cross-modal architecture utilizing sequence co-attention and optimal transport regularization for improved MOS prediction in text-to-music tasks.

Findings

01

7% improvement in Spearman correlation for overall music quality

02

14% improvement in text alignment accuracy

03

Optimal transport regularization yields 10% SRCC gain

Abstract

Mean Opinion Score (MOS) prediction for text to music systems requires evaluating both overall musical quality and text prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence level co-attention and optimal transport regularization. WhisQ employs the Whisper Base pretrained model for temporal audio encoding and Qwen 3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while TA leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis

MethodsBalanced Selection