Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large   Language Models

Yiming Chen; Xianghu Yue; Xiaoxue Gao; Chen Zhang; Luis Fernando; D'Haro; Robby T. Tan; Haizhou Li

arXiv:2409.18680·cs.SD·November 7, 2024

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando, D'Haro, Robby T. Tan, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-audio evaluation benchmark and a novel multi-audio large language model, demonstrating improved multi-audio processing capabilities and data efficiency, advancing towards human-like auditory understanding in machines.

Contribution

It presents the first multi-audio evaluation benchmark and a new multi-audio LLM that effectively captures audio context among multiple streams using synthetic data.

Findings

01

Existing ALLMs struggle with multi-audio scenarios.

02

MALLM outperforms baselines in multi-audio tasks.

03

Synthetic data enables high data efficiency.

Abstract

Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MatthewCYM/MALLM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsMasked autoencoder · Focus