FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

Xiquan Li; Xuenan Xu; Ziyang Ma; Wenxi Chen; Haolin He; Qiuqiang Kong; Xie Chen

arXiv:2604.01155·cs.SD·April 2, 2026

FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

Xiquan Li, Xuenan Xu, Ziyang Ma, Wenxi Chen, Haolin He, Qiuqiang Kong, Xie Chen

PDF

1 Models

TL;DR

FineLAP introduces a novel training paradigm for audio-language models that enhances both clip- and frame-level understanding by leveraging heterogeneous supervision and a new dataset.

Contribution

It proposes a dual-stream sigmoid loss with cluster-based sampling and a large-scale synthetic dataset to improve fine-grained audio-language pretraining.

Findings

01

Achieves state-of-the-art results on multiple audio understanding tasks.

02

Demonstrates mutual benefits of coarse- and fine-grained alignment.

03

Introduces FineLAP-100k, a large synthetic SED dataset.

Abstract

Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AndreasXi/FineLAP
model· 198 dl· ♡ 2
198 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.