Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi

TL;DR
This paper introduces Focus-Then-Listen (FTL), a plug-and-play audio enhancer that significantly improves noise robustness of large audio language models without requiring retraining or task-specific noisy data.
Contribution
The paper presents a novel, modular audio enhancement method that enhances LALMs' robustness to noise through a speech/non-speech separation and modality-aware fusion, without retraining the models.
Findings
FTL improves LALMs' performance across various noise conditions.
FTL does not require fine-tuning of the underlying models.
FTL is effective across multiple LALMs and tasks.
Abstract
Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
