FLAM: Frame-Wise Language-Audio Modeling

Yusong Wu; Christos Tsirigotis; Ke Chen; Cheng-Zhi Anna Huang; Aaron Courville; Oriol Nieto; Prem Seetharaman; Justin Salamon

arXiv:2505.05335·cs.SD·June 10, 2025

FLAM: Frame-Wise Language-Audio Modeling

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon

PDF

Open Access 1 Models 1 Video

TL;DR

FLAM is a novel open-vocabulary contrastive audio-language model that localizes specific sound events at the frame level, overcoming limitations of traditional sound event detection models and enhancing fine-grained audio understanding.

Contribution

Introduces FLAM, a frame-wise, open-vocabulary audio-language model with a calibrated contrastive learning approach and large-scale diverse data for improved event localization.

Findings

01

Significantly improves frame-wise sound event localization.

02

Maintains strong performance in global retrieval tasks.

03

Effective in open-vocabulary and out-of-distribution scenarios.

Abstract

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kechenadobe/OpenFLAM
model· ♡ 3
♡ 3

Videos

FLAM: Frame-Wise Language-Audio Modeling· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis