MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
Bochao Sun, Yang Xiao, Han Yin

TL;DR
This paper introduces MoEScore, an objective evaluation model for Text-to-Audio systems that uses a Mixture of Experts architecture with Sequential Cross-Attention, achieving state-of-the-art correlation with human judgments.
Contribution
The paper presents the first MoE-based model for TTA evaluation, significantly improving semantic fidelity assessment over traditional methods.
Findings
Achieved first place in the XACLE Challenge.
SRCC of 0.6402, 30.6% better than baseline.
Demonstrated effectiveness of MoE architecture for TTA evaluation.
Abstract
Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
