Utterance Clustering Using Stereo Audio Channels

Yingjun Dong; Neil G. MacLaren; Yiding Cao; Francis J. Yammarino,; Shelley D. Dionne; Michael D. Mumford; Shane Connelly; Hiroki Sayama; and; Gregory A. Ruark

arXiv:2009.05076·eess.AS·September 22, 2021

Utterance Clustering Using Stereo Audio Channels

Yingjun Dong, Neil G. MacLaren, Yiding Cao, Francis J. Yammarino,, Shelley D. Dionne, Michael D. Mumford, Shane Connelly, Hiroki Sayama, and, Gregory A. Ruark

PDF

Open Access

TL;DR

This paper enhances utterance clustering by leveraging stereo audio channels and Gaussian mixture models, demonstrating improved accuracy over mono audio methods in complex multi-person scenarios.

Contribution

It introduces a novel approach of combining stereo channels and Gaussian mixture models for more effective supervised utterance clustering.

Findings

01

Stereo audio processing improves clustering accuracy.

02

Gaussian mixture models effectively identify speakers.

03

Method outperforms mono audio approaches in complex environments.

Abstract

Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then extracted embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter sharing Gaussian mixture model was conducted to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multi-person discussion sessions showed that the proposed method that used multichannel audio signals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing