Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models
Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, Xiangdong Wang

TL;DR
This paper introduces TimeAudio, a novel method that enhances large audio-language models with precise temporal perception and efficient long audio understanding, enabling better performance on fine-grained temporal tasks.
Contribution
We propose TimeAudio, which incorporates temporal markers, absolute time-aware encoding, and a segment-level token merging module to improve temporal reasoning and long audio comprehension in LALMs.
Findings
Strong performance on dense captioning, temporal grounding, and timeline speech summarization.
Effective temporal localization and reasoning capabilities demonstrated.
New dataset and metrics for evaluating fine-grained temporal tasks.
Abstract
Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
