Augmenting Transformer-Transducer Based Speaker Change Detection With   Token-Level Training Loss

Guanlong Zhao; Quan Wang; Han Lu; Yiling Huang; Ignacio Lopez Moreno

arXiv:2211.06482·eess.AS·December 6, 2022

Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss

Guanlong Zhao, Quan Wang, Han Lu, Yiling Huang, Ignacio Lopez Moreno

PDF

Open Access

TL;DR

This paper introduces a token-level training loss for Transformer-Transducer based speaker change detection, significantly enhancing accuracy by focusing on speaker change errors during training.

Contribution

It proposes a novel token-based training strategy with a custom edit-distance algorithm to improve speaker change detection performance.

Findings

01

Significant performance improvements on real-world datasets.

02

Effective reduction in false accept and false reject rates.

03

Enhanced evaluation metrics aligned with commercial needs.

Abstract

In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy. To mitigate this issue, we use a customized edit-distance algorithm to estimate the token-level SCD false accept (FA) and false reject (FR) rates during training and optimize model parameters to minimize a weighted combination of the FA and FR, focusing the model on accurately predicting speaker changes. We also propose a set of evaluation metrics that align better with commercial use cases. Experiments on a group of challenging real-world datasets show that the proposed training method can significantly improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsALIGN · Feedback Alignment