Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification
Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura

TL;DR
This paper introduces a self-supervised learning approach using Masked Autoencoders with Transformers to improve call segmentation, classification, and caller identification in noisy, low-resource marmoset vocalization data.
Contribution
It demonstrates that MAE-pretrained Transformers outperform CNNs in modeling complex, variable marmoset vocal patterns, especially in low-resource, noisy environments.
Findings
MAE pretraining enhances model stability and generalization.
Transformers outperform CNNs in call segmentation and classification.
Self-supervised learning is effective for non-human vocal communication modeling.
Abstract
The marmoset, a highly vocal primate, is a key model for studying social-communicative behavior. Unlike human speech, marmoset vocalizations are less structured, highly variable, and recorded in noisy, low-resource conditions. Learning marmoset communication requires joint call segmentation, classification, and caller identification -- challenging domain tasks. Previous CNNs handle local patterns but struggle with long-range temporal structure. We applied Transformers using self-attention for global dependencies. However, Transformers show overfitting and instability on small, noisy annotated datasets. To address this, we pretrain Transformers with MAE -- a self-supervised method reconstructing masked segments from hundreds of hours of unannotated marmoset recordings. The pretraining improved stability and generalization. Results show MAE-pretrained Transformers outperform CNNs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlant Reproductive Biology · Plant Physiology and Cultivation Studies · Animal Vocal Communication and Behavior
MethodsLinear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Attention Is All You Need · Multi-Head Attention · Softmax
