Speech Diarization and ASR with GMM

Aayush Kumar Sharma; Vineet Bhavikatti; Amogh Nidawani; Siddappaji,; Sanath P; Dr Geetishree Mishra

arXiv:2307.05637·eess.AS·September 1, 2024·1 cites

Speech Diarization and ASR with GMM

Aayush Kumar Sharma, Vineet Bhavikatti, Amogh Nidawani, Siddappaji,, Sanath P, Dr Geetishree Mishra

PDF

Open Access

TL;DR

This paper explores speech diarization and ASR using GMMs, focusing on separating speakers and transcribing speech with minimal word errors, by leveraging GMM parameters and pitch analysis.

Contribution

It introduces a GMM-based approach for speech diarization integrated with ASR, emphasizing the use of inter-cluster distances and pitch features to improve speaker separation and transcription accuracy.

Findings

01

GMM effectively models speech segments for diarization.

02

Inter-cluster distance threshold improves speaker segmentation.

03

Pitch analysis enhances speech recognition accuracy.

Abstract

In this research paper, we delve into the topics of Speech Diarization and Automatic Speech Recognition (ASR). Speech diarization involves the separation of individual speakers within an audio stream. By employing the ASR transcript, the diarization process aims to segregate each speaker's utterances, grouping them based on their unique audio characteristics. On the other hand, Automatic Speech Recognition refers to the capability of a machine or program to identify and convert spoken words and phrases into a machine-readable format. In our speech diarization approach, we utilize the Gaussian Mixer Model (GMM) to represent speech segments. The inter-cluster distance is computed based on the GMM parameters, and the distance threshold serves as the stopping criterion. ASR entails the conversion of an unknown speech waveform into a corresponding written transcription. The speech signal is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing