From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
Chun-Yi Kuan, Hung-yi Lee

TL;DR
This paper introduces BALSa, a scalable framework that uses synthetic data generation from backbone LLMs to improve audio-language alignment in ALLMs, reducing hallucinations and enhancing understanding.
Contribution
The paper proposes a novel synthetic data generation method for training ALLMs, extending to multi-audio scenarios, which improves alignment and reduces hallucinations compared to prior approaches.
Findings
Mitigates audio hallucinations effectively
Maintains strong performance on audio understanding benchmarks
Enhances multi-audio reasoning capabilities
Abstract
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
