SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

Zhaoxi Mu; Xinyu Yang; Gang Wang

arXiv:2505.03273·cs.SD·May 27, 2025

SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

Zhaoxi Mu, Xinyu Yang, Gang Wang

PDF

Open Access

TL;DR

SepALM introduces an innovative audio language model-based framework that corrects and re-synthesizes separated speech in the text domain, significantly improving robustness and accuracy in challenging real-world acoustic environments.

Contribution

It presents a novel end-to-end error correction approach using audio language models for speech separation, overcoming limitations of traditional methods and enhancing adaptability.

Findings

01

Improves speech separation accuracy in noisy environments

02

Reduces error accumulation compared to conventional methods

03

Enhances adaptability to diverse acoustic settings

Abstract

While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsKnowledge Distillation