Jointly Tracking and Separating Speech Sources Using Multiple Features and the generalized labeled multi-Bernoulli Framework
Shoufeng Lin

TL;DR
This paper introduces a joint multi-speaker tracking and separation method using a generalized labeled multi-Bernoulli filter that leverages multiple features like location, pitch, and sound to improve accuracy and resolve spatial ambiguities.
Contribution
It presents a novel multi-feature GLMB tracking filter that jointly tracks and separates speakers by incorporating multiple features and their transition models.
Findings
Successfully tracks multiple speakers in complex scenarios.
Effectively separates speech signals while tracking.
Addresses spatial ambiguity in multi-speaker environments.
Abstract
This paper proposes a novel joint multi-speaker tracking-and-separation method based on the generalized labeled multi-Bernoulli (GLMB) multi-target tracking filter, using sound mixtures recorded by microphones. Standard multi-speaker tracking algorithms usually only track speaker locations, and ambiguity occurs when speakers are spatially close. The proposed multi-feature GLMB tracking filter treats the set of vectors of associated speaker features (location, pitch and sound) as the multi-target multi-feature observation, characterizes transitioning features with corresponding transition models and overall likelihood function, thus jointly tracks and separates each multi-feature speaker, and addresses the spatial ambiguity problem. Numerical evaluation verifies that the proposed method can correctly track locations of multiple speakers and meanwhile separate speech signals.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
