Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

Mingchen Shao; Hang Su; Wenjie Tian; Bingshen Mu; Zhennan Lin; Lichun Fan; Zhenbo Luo; Jian Luan; Lei Xie

arXiv:2604.22245·eess.AS·April 27, 2026

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

Mingchen Shao, Hang Su, Wenjie Tian, Bingshen Mu, Zhennan Lin, Lichun Fan, Zhenbo Luo, Jian Luan, Lei Xie

PDF

1 Repo 2 Models 2 Datasets

TL;DR

This paper introduces LAT-Audio, a new framework and dataset for improving long-form audio understanding by emphasizing temporal awareness through progressive reasoning and iterative local context integration.

Contribution

The paper presents LAT-Audio, a novel approach with a new dataset and benchmark tailored for long-form audio temporal tasks, addressing previous limitations in temporal alignment accuracy.

Findings

01

LAT-Audio outperforms existing models on long-form temporal tasks.

02

The new dataset LAT-Chronicle contains 1.2k hours of annotated long-form audio.

03

The benchmark supports three core long-form audio understanding tasks.

Abstract

While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alanshaoTT/LAT-Audio-Repo
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.