Self-Improvement for Audio Large Language Model using Unlabeled Speech

Shaowen Wang; Xinyuan Chen; Yao Xu

arXiv:2507.20169·cs.SD·July 29, 2025

Self-Improvement for Audio Large Language Model using Unlabeled Speech

Shaowen Wang, Xinyuan Chen, Yao Xu

PDF

TL;DR

This paper introduces SI-SDA, a self-improvement method for audio large language models that enhances domain-specific performance without labeled data, using reinforcement learning to optimize pseudo labels.

Contribution

The paper presents a novel self-improvement approach leveraging large-model decoding for domain adaptation without labeled data, outperforming existing baselines.

Findings

01

Significant improvements in WER and BLEU scores across multiple datasets.

02

High data efficiency demonstrated in experiments.

03

Effective domain adaptation without labeled data.

Abstract

Recent audio LLMs have emerged rapidly, demonstrating strong generalization across various speech tasks. However, given the inherent complexity of speech signals, these models inevitably suffer from performance degradation in specific target domains. To address this, we focus on enhancing audio LLMs in target domains without any labeled data. We propose a self-improvement method called SI-SDA, leveraging the information embedded in large-model decoding to evaluate the quality of generated pseudo labels and then perform domain adaptation based on reinforcement learning optimization. Experimental results show that our method consistently and significantly improves audio LLM performance, outperforming existing baselines in WER and BLEU across multiple public datasets of automatic speech recognition (ASR), spoken question-answering (SQA), and speech-to-text translation (S2TT). Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.