MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation

Bochao Sun; Yang Xiao; Han Yin

arXiv:2601.06829·cs.SD·January 13, 2026

MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation

Bochao Sun, Yang Xiao, Han Yin

PDF

Open Access

TL;DR

This paper introduces MoEScore, an objective evaluation model for Text-to-Audio systems that uses a Mixture of Experts architecture with Sequential Cross-Attention, achieving state-of-the-art correlation with human judgments.

Contribution

The paper presents the first MoE-based model for TTA evaluation, significantly improving semantic fidelity assessment over traditional methods.

Findings

01

Achieved first place in the XACLE Challenge.

02

SRCC of 0.6402, 30.6% better than baseline.

03

Demonstrated effectiveness of MoE architecture for TTA evaluation.

Abstract

Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies