Disentangled-Transformer: An Explainable End-to-End Automatic Speech   Recognition Model with Speech Content-Context Separation

Pu Wang; Hugo Van hamme

arXiv:2411.17846·eess.AS·November 28, 2024·IPAS

Disentangled-Transformer: An Explainable End-to-End Automatic Speech Recognition Model with Speech Content-Context Separation

Pu Wang, Hugo Van hamme

PDF

Open Access

TL;DR

This paper introduces the Disentangled-Transformer, an explainable end-to-end speech recognition model that separates speech content from speaker traits, enhancing interpretability and performance in speaker diarization.

Contribution

The study presents a novel transformer-based model that explicitly disentangles speech content and speaker traits, improving interpretability and diarization accuracy.

Findings

01

Effective separation of speaker identity from speech content.

02

Improved ASR performance with disentangled representations.

03

Enhanced interpretability of internal model representations.

Abstract

End-to-end transformer-based automatic speech recognition (ASR) systems often capture multiple speech traits in their learned representations that are highly entangled, leading to a lack of interpretability. In this study, we propose the explainable Disentangled-Transformer, which disentangles the internal representations into sub-embeddings with explicit content and speaker traits based on varying temporal resolutions. Experimental results show that the proposed Disentangled-Transformer produces a clear speaker identity, separated from the speech content, for speaker diarization while improving ASR performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques