No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings
Chenggang Chen, Zhiyu Yang

TL;DR
This study benchmarks 11 audio-pretrained deep learning models in bioacoustics, revealing that without fine-tuning, they often underperform and highlighting the importance of fine-tuning for effective embeddings.
Contribution
It provides a comprehensive benchmark of audio-pretrained models in bioacoustics, emphasizing the necessity of fine-tuning for optimal performance.
Findings
Pretrained models without fine-tuning underperform compared to fine-tuned AlexNet.
ResNet can separate background sounds from labeled sounds, unlike other models.
Fewer background sounds during fine-tuning improve model performance.
Abstract
Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings' dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
