SPO-CLAPScore: Enhancing CLAP-based alignment prediction system with Standardize Preference Optimization, for the first XACLE Challenge
Taisei Takano, Ryoya Yoshida

TL;DR
This paper introduces SPO-CLAPScore, a system that improves audio-text alignment evaluation by standardizing listener scores and screening inconsistent raters, achieving competitive correlation with human judgment in the XACLE Challenge.
Contribution
The paper presents Standardized Preference Optimization (SPO), a novel training method that enhances CLAP-based alignment prediction by reducing scoring biases and improving correlation with human perception.
Findings
Achieved 6th place in XACLE Challenge with SRCC of 0.6142.
SPO and listener screening significantly improve alignment score correlation.
Code is publicly available for reproducibility.
Abstract
The first XACLE Challenge (x-to-audio alignment challenge) addresses the critical need for automatic evaluation metrics that correlate with human perception of audio-text semantic alignment. In this paper, we describe the "Takano_UTokyo_03" system submitted to XACLE Challenge. Our approach leverages a CLAPScore-based architecture integrated with a novel training method called Standardized Preference Optimization (SPO). SPO standardizes the raw alignment scores provided by each listener, enabling the model to learn relative preferences and mitigate the impact of individual scoring biases. Additionally, we employ listener screening to exclude listeners with inconsistent ratings. Experimental evaluations demonstrate that both SPO and listener screening effectively improve the correlation with human judgment. Our system achieved 6th place in the challenge with a Spearman's rank correlation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
