Enhancing Temporal Understanding in Audio Question Answering for Large   Audio Language Models

Arvind Krishna Sridhar; Yinyi Guo; Erik Visser

arXiv:2409.06223·cs.SD·December 16, 2024

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

PDF

Open Access 1 Video

TL;DR

This paper improves temporal reasoning in Large Audio Language Models for audio question answering by introducing data augmentation, curriculum fine-tuning, and on-device deployment, enhancing their practical applicability.

Contribution

It proposes a novel data augmentation method and curriculum fine-tuning strategy to enhance temporal reasoning in LALMs for audio QA tasks.

Findings

01

Enhanced temporal reasoning performance on benchmark datasets

02

Successful on-device CPU inference demonstration

03

Improved model specialization without performance trade-offs

Abstract

The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning. Recently, AQA has garnered attention due to the advent of Large Audio Language Models (LALMs). Current literature focuses on constructing LALMs by integrating audio encoders with text-only Large Language Models (LLMs) through a projection module. While LALMs excel in general audio understanding, they are limited in temporal reasoning, which may hinder their commercial applications and on-device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we perform a further fine-tuning of an existing baseline using curriculum learning strategy to specialize in temporal reasoning without compromising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models· underline

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need