A Theoretical Framework for Acoustic Neighbor Embeddings
Woojay Jeon

TL;DR
This paper introduces a probabilistic theoretical framework for acoustic neighbor embeddings, enabling principled interpretation and application in phonetic similarity, with empirical validation across diverse tasks including word classification and dialect clustering.
Contribution
It provides a novel probabilistic interpretation of acoustic neighbor embeddings and demonstrates their effectiveness in phonetic tasks, supported by theoretical and empirical evidence.
Findings
Nearest-neighbor search matches FST accuracy for large vocabularies
Embedding distances closely approximate phone edit distances in OOV word recovery
Clustering hierarchies align with human listening experiments
Abstract
This paper provides a theoretical framework for interpreting acoustic neighbor embeddings, which are representations of the phonetic content of variable-width audio or text in a fixed-dimensional embedding space. A probabilistic interpretation of the distances between embeddings is proposed, based on a general quantitative definition of phonetic similarity between words. This provides us a framework for understanding and applying the embeddings in a principled manner. Theoretical and empirical evidence to support an approximation of uniform cluster-wise isotropy are shown, which allows us to reduce the distances to simple Euclidean distances. Four experiments that validate the framework and demonstrate how it can be applied to diverse problems are described. Nearest-neighbor search between audio and text embeddings can give isolated word classification accuracy that is identical to that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
