GAMA: A Large Audio-Language Model with Advanced Audio Understanding and   Complex Reasoning Abilities

Sreyan Ghosh; Sonal Kumar; Ashish Seth; Chandra Kiran Reddy; Evuru; Utkarsh Tyagi; S Sakshi; Oriol Nieto; Ramani Duraiswami; and Dinesh Manocha

arXiv:2406.11768·cs.SD·June 18, 2024

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy, Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha

PDF

Open Access 2 Repos 2 Datasets 1 Video

TL;DR

GAMA is a large audio-language model that integrates advanced audio understanding with complex reasoning, achieved through novel training methods and datasets, outperforming existing models on diverse audio tasks.

Contribution

The paper introduces GAMA, a new large audio-language model with integrated complex reasoning abilities, developed via innovative instruction tuning and multi-layer audio feature aggregation.

Findings

01

GAMA outperforms existing models on diverse audio understanding tasks by 1%-84%.

02

Instruction tuning with CompA-R enhances GAMA's complex reasoning and instruction-following abilities.

03

GAMA demonstrates superior performance in open-ended audio question-answering.

Abstract

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies