WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction
Jakaria Islam Emon, Kazi Tamanna Alam, Md. Abu Salek

TL;DR
WhisQ is a novel multimodal model that predicts music quality and text alignment by combining sequence co-attention and optimal transport, significantly improving evaluation accuracy in text-to-music systems.
Contribution
The paper introduces WhisQ, a new cross-modal architecture utilizing sequence co-attention and optimal transport regularization for improved MOS prediction in text-to-music tasks.
Findings
7% improvement in Spearman correlation for overall music quality
14% improvement in text alignment accuracy
Optimal transport regularization yields 10% SRCC gain
Abstract
Mean Opinion Score (MOS) prediction for text to music systems requires evaluating both overall musical quality and text prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence level co-attention and optimal transport regularization. WhisQ employs the Whisper Base pretrained model for temporal audio encoding and Qwen 3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while TA leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
MethodsBalanced Selection
