QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

Chien-Chun Wang; Kuan-Tang Huang; Cheng-Yeh Yang; Hung-Shin Lee; Hsin-Min Wang; Berlin Chen

arXiv:2508.08957·cs.SD·August 13, 2025

QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

Chien-Chun Wang, Kuan-Tang Huang, Cheng-Yeh Yang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

PDF

Open Access

TL;DR

QAMRO is a novel framework that improves the evaluation of audio generation systems by aligning machine assessments more closely with human perception through adaptive ranking optimization.

Contribution

It introduces a new quality-aware adaptive margin ranking optimization method that integrates multiple regression perspectives for better human-aligned audio evaluation.

Findings

01

Outperforms baseline models in human evaluation alignment

02

Leverages pre-trained audio-text models like CLAP and Audiobox-Aesthetics

03

Achieves superior results on the AudioMOS Challenge 2025 dataset

Abstract

Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Emotion and Mood Recognition