The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels

Ayuto Tsutsumi; Kohei Tanaka; Sayaka Shiota

arXiv:2602.00604·cs.SD·February 3, 2026

The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels

Ayuto Tsutsumi, Kohei Tanaka, Sayaka Shiota

PDF

Open Access 1 Models

TL;DR

This paper presents a large audio language model system for the XACLE challenge, utilizing a three-stage training pipeline with CLAP pseudo-labels, achieving significant performance improvements and securing third place.

Contribution

The novel approach combines automated captioning, CLAP pseudo-label pretraining, and fine-tuning, demonstrating the effectiveness of CLAP pseudo-labels in large audio language model training.

Findings

01

Pretraining with CLAP pseudo-labels significantly improves performance.

02

The system achieves an SRCC of 0.632 on the XACLE test set.

03

The approach outperforms the baseline and ranks third in the challenge.

Abstract

In this paper, we propose a submission to the x-to-audio alignment (XACLE) challenge. The goal is to predict semantic alignment of a given general audio and text pair. The proposed system is based on a large audio language model (LALM) architecture. We employ a three-stage training pipeline: automated audio captioning pretraining, pretraining with CLAP pseudo-labels, and fine-tuning on the XACLE dataset. Our experiments show that pretraining with CLAP pseudo-labels is the primary performance driver. On the XACLE test set, our system reaches an SRCC of 0.632, significantly outperforming the baseline system (0.334) and securing third place in the challenge team ranking. Code and models can be found at https://github.com/shiotalab-tmu/tmu-xacle2026

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Atotti/xacle-tmu-2026
model· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing