TL;DR
This paper explores linear preprocessing techniques, especially ICA, to improve the extraction of discrete speech units from self-supervised speech model representations, enhancing clustering quality for speech recognition tasks.
Contribution
It introduces the use of ICA as a preprocessing step for clustering S3M representations, providing extensive analysis of its effects on DSU quality and interpretability.
Findings
ICA improves DSU clustering performance.
Preprocessing methods like ICA enhance speech recognition accuracy.
ICA components show orthogonality and interpretability.
Abstract
Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsIndependent Component Analysis
