GPU-accelerated Guided Source Separation for Meeting Transcription
Desh Raj, Daniel Povey, Sanjeev Khudanpur

TL;DR
This paper presents a GPU-accelerated implementation of Guided Source Separation (GSS) that significantly speeds up processing, enabling detailed analysis and improved meeting transcription performance on standard benchmarks.
Contribution
The paper introduces a GPU-based GSS implementation that achieves 300x faster inference, facilitating extensive ablation studies and practical meeting transcription applications.
Findings
300x speed-up over CPU-based GSS
Enables detailed parameter ablation studies
Provides reproducible pipelines for meeting benchmarks
Abstract
Guided source separation (GSS) is a type of target-speaker extraction method that relies on pre-computed speaker activities and blind source separation to perform front-end enhancement of overlapped speech signals. It was first proposed during the CHiME-5 challenge and provided significant improvements over the delay-and-sum beamforming baseline. Despite its strengths, however, the method has seen limited adoption for meeting transcription benchmarks primarily due to its high computation time. In this paper, we describe our improved implementation of GSS that leverages the power of modern GPU-based pipelines, including batched processing of frequencies and segments, to provide 300x speed-up over CPU-based inference. The improved inference time allows us to perform detailed ablation studies over several parameters of the GSS algorithm -- such as context duration, number of channels, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
