JOOCI: a Framework for Learning Comprehensive Speech Representations
Hemant Yadav, Rajiv Ratn Shah, Sunayana Sitaram

TL;DR
JOOCI is a new speech representation learning framework that effectively captures both content and paralinguistic information simultaneously, outperforming existing models on multiple benchmark tasks.
Contribution
It introduces a novel method that jointly optimizes for content and other speech information without layer-wise trade-offs, improving performance over prior SSL models.
Findings
JOOCI outperforms WavLM by 26.5% on benchmark tasks.
It achieves superior results on speaker recognition and language tasks.
The method effectively captures both content and paralinguistic features.
Abstract
Information in speech can be categorized into two groups: Content (what is being said, such as linguistics) and Other (how it is expressed such as information about speaker and paralinguistic features). Current self-supervised learning (SSL) methods are shown to divide the model's representational-depth or layers in two, with earlier layers specializing in Other and later layers in Content related tasks. This layer-wise division is inherently sub-optimal, as neither information type can use all layers to build hierarchical representations. To address this, we propose JOOCI, a novel speech representation learning method that does not compromise on the representational-depth for either information type. JOOCI outperforms WavLM by 26.5%, and other models of similar size (100M parameters), when evaluated on two speaker recognition and two language tasks from the SUPERB benchmark,…
Peer Reviews
Decision·Submitted to ICLR 2025
The proposed method is sound. The experimental results on a subset of SUPERB benchmark are strong.
- The novelty is limited. The proposed method is very close to a number of existing works, e.g.: - Chan et al., Content-Context Factorized Representations for Automated Speech Recognition, InterSpeech 2022. - Zhao et al., CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation, EMNLP 2023. - The main claim is flawed. The paper claims SOTA on SUPERB. However, it only reports experimental results on a subset of the tasks from SUPERB (7 out of 10)
+ Addresses an important need to account for both linguistic and non-linguistic content in speech representation learning. + Obtains impressive results on several tasks, including speech recognition and speaker identification.
- Presentation of many details is unclear. For example, the definition of "content" and "other" is never clearly stated. Also, the model description is very brief, leaving many details to cited papers or the imagination (for example, is prosody ever/always/sometimes considered "content"?). Either the writing should be much more precise or the paper should include equations specifying all of the model components. See some other specific questions below. - The key claimed contribution is that
* The research community is highly interested in the topic of speech representation learning. * The proposed method's evaluation on certain SUPERB tasks yielded better results compared to the cited systems. * The discussions and comparisons presented are technically sound.
Major issues: * The model's effectiveness is unconvincing. The baselines cited are outdated and not state-of-the-art, and the model's performances on the semantic tasks are not better. * The paper's discussion of different model architectures is shallow, limiting its contribution and making it difficult to draw general conclusions. Minor: * Figure 1 could be simplified by removing the hyperparameters. * The discussion of "Data augmentation" in Line 52 seems out of place, as the initial focus
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
