Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Han Yin; Yang Xiao; Younghoo Kwon; Ting Dang; Jung-Woo Choi

arXiv:2603.04862·cs.SD·March 10, 2026

Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi

PDF

Open Access

TL;DR

This paper introduces Focus-Then-Listen (FTL), a plug-and-play audio enhancer that significantly improves noise robustness of large audio language models without requiring retraining or task-specific noisy data.

Contribution

The paper presents a novel, modular audio enhancement method that enhances LALMs' robustness to noise through a speech/non-speech separation and modality-aware fusion, without retraining the models.

Findings

01

FTL improves LALMs' performance across various noise conditions.

02

FTL does not require fine-tuning of the underlying models.

03

FTL is effective across multiple LALMs and tasks.

Abstract

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing