From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Chun-Yi Kuan; Hung-yi Lee

arXiv:2505.20166·eess.AS·January 13, 2026

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Chun-Yi Kuan, Hung-yi Lee

PDF

TL;DR

This paper introduces BALSa, a scalable framework that uses synthetic data generation from backbone LLMs to improve audio-language alignment in ALLMs, reducing hallucinations and enhancing understanding.

Contribution

The paper proposes a novel synthetic data generation method for training ALLMs, extending to multi-audio scenarios, which improves alignment and reduces hallucinations compared to prior approaches.

Findings

01

Mitigates audio hallucinations effectively

02

Maintains strong performance on audio understanding benchmarks

03

Enhances multi-audio reasoning capabilities

Abstract

Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.