Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Hualei Wang; Yiming Li; Shuo Ma; Hong Liu; Xiangdong Wang

arXiv:2511.11039·cs.SD·December 15, 2025

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, Xiangdong Wang

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces TimeAudio, a novel method that enhances large audio-language models with precise temporal perception and efficient long audio understanding, enabling better performance on fine-grained temporal tasks.

Contribution

We propose TimeAudio, which incorporates temporal markers, absolute time-aware encoding, and a segment-level token merging module to improve temporal reasoning and long audio comprehension in LALMs.

Findings

01

Strong performance on dense captioning, temporal grounding, and timeline speech summarization.

02

Effective temporal localization and reasoning capabilities demonstrated.

03

New dataset and metrics for evaluating fine-grained temporal tasks.

Abstract

Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lysanderism/TimeAudio
model· ♡ 1
♡ 1

Datasets

lysanderism/FTAR
dataset· 34 dl
34 dl

Videos

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models· underline

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis