Towards Leveraging Sequential Structure in Animal Vocalizations
Eklavya Sarkar, Mathew Magimai.-Doss

TL;DR
This study explores the use of vector-quantized token sequences derived from self-supervised speech models to capture and utilize the sequential structure in animal vocalizations, improving classification tasks.
Contribution
It introduces a novel approach using vector quantization of speech model representations to encode temporal order in animal calls, which enhances bioacoustic analysis.
Findings
Token sequences can discriminate call-types and callers.
Sequence-based features improve classification performance.
Vector-quantized tokens hold promise for bioacoustic analysis.
Abstract
Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using -Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnimal Vocal Communication and Behavior · Neuroendocrine regulation and behavior · Marine animal studies overview
