Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn

TL;DR
This study evaluates the effectiveness of masked autoencoders (MAE) in fine-grained bioacoustic species classification, revealing that pretraining on diverse data generally outperforms domain-specific pretraining in limited-data scenarios.
Contribution
The paper systematically analyzes MAE pretraining in bioacoustics, showing that data scale outweighs domain-specific tuning for moderate-sized datasets.
Findings
Pretraining on diverse general audio data yields better transfer performance.
Additional domain-specific masked reconstruction pretraining offers limited or negative benefits.
Selective data filtering provides negligible gains when data is limited.
Abstract
Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
