Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?
Eklavya Sarkar, Mathew Magimai.-Doss

TL;DR
This study demonstrates that self-supervised neural representations trained on human speech can effectively distinguish individual Marmoset callers, highlighting their cross-domain transferability to bio-acoustic analysis without additional training.
Contribution
It shows that SSL models pre-trained on human speech can be directly applied to bio-acoustic signals for caller identification, without fine-tuning.
Findings
SSL embeddings encode individual caller information
Models successfully distinguish Marmoset callers without fine-tuning
Cross-domain transferability of SSL representations is effective
Abstract
Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnimal Vocal Communication and Behavior · Speech and Audio Processing · Speech Recognition and Synthesis
