NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

David Robinson; Marius Miron; Masato Hagiwara; Benno Weck; Sara Keen; Milad Alizadeh; Gagan Narula; Matthieu Geist; Olivier Pietquin

arXiv:2411.07186·cs.SD·July 1, 2025·5 cites

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

David Robinson, Marius Miron, Masato Hagiwara, Benno Weck, Sara Keen, Milad Alizadeh, Gagan Narula, Matthieu Geist, Olivier Pietquin

PDF

Open Access 1 Models 1 Datasets 1 Video 3 Reviews

TL;DR

NatureLM-audio is a pioneering audio-language foundation model tailored for bioacoustics, enabling improved zero-shot classification and generalization across diverse animal vocalization tasks, thereby supporting conservation and biodiversity efforts.

Contribution

It introduces the first bioacoustics-specific audio-language foundation model trained on curated data, demonstrating transfer learning from speech and music to bioacoustics and establishing new state-of-the-art results.

Findings

01

Sets new state-of-the-art on bioacoustics tasks

02

Shows effective transfer from speech and music models

03

Achieves promising generalization to unseen species

Abstract

Large language models (LLMs) prompted with text and audio have achieved state-of-the-art performance across various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, their potential has yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior -- tasks that are crucial for conservation, biodiversity monitoring, and animal behavior studies. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our training dataset consists of carefully curated text-audio pairs spanning bioacoustics, speech, and music, designed to address the field's limited availability of annotated data. We demonstrate successful transfer of learned…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

An incredible collection of datasets, and careful curation. A lot of ancillary code for use of the data in various learning tasks.

Weaknesses

Unclear from the presentation if the authors intend to make the dataset widely available, and under what license.

Reviewer 02Rating 5Confidence 4

Strengths

1. Addresses a important topic from both the ML research community ( since audio and especially computational bioacoustics is a hard problem) and societal importance. 2. Collects a comprehensive training dataset and extends an existing evaluation benchmark with additional tasks. 3. The performance improvements compared to a model not trained on bioacoustics data (SALMONN) supports the claim that this domain is in the need for a own foundation model.

Weaknesses

1. Soundness of results: Your presented results only show a minor improvement compared to BioLingual (which also presents zero shot results on BEANS, there numbers differ sometimes why?), so whats the benefit of your approach and more particularly does integrating a LLM has a benefit? Or is it the different training dataset? Or the audio encoder (BEATs vs. HTS-AT)? 2. No further details for replication of the experiments are given, e.g. pretrained models or the list of species which were hold ou

Reviewer 03Rating 3Confidence 5

Strengths

1. The introduction of NatureLM, the first audio-language model specifically designed for bioacoustics, represents a promising new direction for incorporating language models into biodiversity monitoring. 2. The development of the BEANS-Zero benchmark extends the original BEANS benchmark by introducing new tasks, such as call-type prediction, life-stage classification, individual counting, and open-ended audio captioning. These additions have the potential to advance bioacoustics research and e

Weaknesses

1. Incorrect Terminology 1.1. The introduction describes BioLingual as self-supervised; however, the supervision is derived from text generated based on class labels. I recommend referring to it as supervised learning with language-based supervision for greater clarity and accuracy. 2.1. Both BioLingual and AVES are described in the paper as foundation models, but this classification may be misleading. BioLingual and AVES are trained on datasets with less than 2 million samples, while models

Code & Models

Models

🤗
EarthSpeciesProject/NatureLM-audio
model· 495 dl· ♡ 30
495 dl♡ 30

Datasets

EarthSpeciesProject/NatureLM-audio-training
dataset· 2.2k dl
2.2k dl

Videos

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics· slideslive

Taxonomy

TopicsAnimal Vocal Communication and Behavior · Music and Audio Processing · Diverse Musicological Studies