FLAM: Frame-Wise Language-Audio Modeling
Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon

TL;DR
FLAM is a novel open-vocabulary contrastive audio-language model that localizes specific sound events at the frame level, overcoming limitations of traditional sound event detection models and enhancing fine-grained audio understanding.
Contribution
Introduces FLAM, a frame-wise, open-vocabulary audio-language model with a calibrated contrastive learning approach and large-scale diverse data for improved event localization.
Findings
Significantly improves frame-wise sound event localization.
Maintains strong performance in global retrieval tasks.
Effective in open-vocabulary and out-of-distribution scenarios.
Abstract
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
