TristouNet: Triplet Loss for Speaker Turn Embedding

Herv\'e Bredin

arXiv:1609.04301·cs.SD·April 12, 2017

TristouNet: Triplet Loss for Speaker Turn Embedding

Herv\'e Bredin

PDF

5 Repos

TL;DR

TristouNet introduces a neural network with triplet loss for embedding speech sequences into a Euclidean space, enabling effective speaker comparison and change detection with improved accuracy over existing methods.

Contribution

It presents a novel LSTM-based architecture trained with triplet loss for speaker turn embedding, enhancing speaker comparison and change detection performance.

Findings

01

Significant improvement over state-of-the-art in speaker comparison.

02

Effective for short speech turn analysis.

03

Demonstrates robustness in speaker change detection.

Abstract

TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.