TL;DR
TristouNet introduces a neural network with triplet loss for embedding speech sequences into a Euclidean space, enabling effective speaker comparison and change detection with improved accuracy over existing methods.
Contribution
It presents a novel LSTM-based architecture trained with triplet loss for speaker turn embedding, enhancing speaker comparison and change detection performance.
Findings
Significant improvement over state-of-the-art in speaker comparison.
Effective for short speech turn analysis.
Demonstrates robustness in speaker change detection.
Abstract
TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
