Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF+ 2025
Anthony Miyaguchi, Murilo Gustineli, and Adrian Cheung

TL;DR
This paper presents a fast, lightweight bioacoustic classification pipeline for BirdCLEF+ 2025, combining optimized pre-trained models and a novel spectrogram tokenization method to meet strict inference time constraints.
Contribution
The paper introduces Spectrogram Token Skip-Gram (STSG), a new sequence modeling approach using spectrogram tokens and static embeddings for efficient bioacoustic classification.
Findings
TFLite optimization achieved 10x inference speedup on the Perch model.
The STSG method provided a viable fast classification with ROC-AUC scores above 0.5.
Optimized pre-trained models achieved competitive scores within 90-minute CPU inference limit.
Abstract
The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly 10x inference speedup for the Perch model, enabling it to run in approximately 16 minutes and achieve a final ROC-AUC score of 0.729 on the public leaderboard post-competition and 0.711 on the private leaderboard. The best model from the zoo was BirdSetEfficientNetB1, with a public score of 0.810 and a private score of 0.778. Second, we introduce a novel, lightweight pipeline named Spectrogram…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
