TL;DR
Perch 2.0 is an advanced bioacoustics model trained on multi-taxa data, achieving state-of-the-art results and demonstrating strong transfer learning capabilities across bird and marine species.
Contribution
It introduces a multi-taxa training approach with novel self-distillation and prototype-learning, enhancing bioacoustic classification and transfer learning performance.
Findings
State-of-the-art results on BirdSet and BEANS benchmarks
Outperforms marine models with minimal marine data
Fine-grained species classification is robust for pre-training
Abstract
Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The paper combines supervised training, prototype-based distillation, and auxiliary objectives in a clear, effective design. - Experiments cover a wide range of benchmarks and are technically solid. - The model transfers well across domains while staying compact and efficient. - Strong performance under linear probing shows the embeddings are general and practical to use. - The model architecture is optimized to be employed in real-world systems, so as to be as light as possible.
- Unclear contribution of components to performance: The paper introduces several methodological components (e.g., multi-source mixup, self-distillation, and an auxiliary source-prediction loss). However, their individual contributions are not clearly isolated, as the paper does not provide controlled ablation studies. In particular, the role of the windowing strategy and the handling of label noise across heterogeneous sources remains insufficiently explained. This raises concerns regarding rep
- The paper is well-written and easy to follow. - The model achieves state-of-the-art results on multiple datasets.
While the developed model shows strong performance, the question remains: what contributes to its strong performance? As multiple changes were made compared with the BirdSet and Perch 1.0 baselines, it is hard to assess the importance of each individual change. Most importantly, it is unclear how much the additional training data contributes to the performance increase relative to the architectural changes and auxiliary losses. Ablation studies could help clarify this. This is especially importa
- **Clear presentation**: The paper is well-written and easy to follow. - **Comprehensive evaluation framework**: The inclusion of different model selection tasks in the evaluation is nice, as it helps to identify both strengths and limitations of the model. - **Pragmatic focus on supervised learning**: The decision to focus on supervised learning rather than following the current trend toward self-supervised methods is commendable. This work demonstrates that supervised approaches remain compet
**Unclear novelty and insufficient differentiation from prior work** The authors claim several contributions, including a novel mixup procedure, a self-distillation process, and a self-supervised auxiliary loss. However, the paper lacks clarity in distinguishing what constitutes genuinely novel contributions versus adaptations of existing techniques. For example, while the authors propose generalizing mixup to more than two components, they do not adequately discuss related work that already ex
1. *Clarity and motivation*: The paper is well-written and very easy to follow. The work is well-motivated by addressing real-world challenges faced by practitioners, such as the need for strong, generalizable embeddings from smaller models that do not require extensive fine-tuning. 2. *Methodological combination:* The work combines existing techniques (self-distillation, source prediction, prototype learning) into a single training framework. This combination is well-suited to the problem of fi
While the proposed method combination is interesting and the results are strong, the paper is limited by a lack of empirical validation. The core issue is an absence of ablation studies, which makes it impossible to attribute the performance gains to the specific contributions claimed by the authors (method-based, data-based, etc.). **1. Confounded contributions and lack of ablations:** The core weakness is that the paper simultaneously introduces multiple changes to the previous model (a larg
Videos
Can AI help to save endangered birds?· youtube
