Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context
Keita Goto, Takashi Maekaku, Jin Sakuma, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe

TL;DR
This paper introduces online registers with a future prediction loss to improve streaming self-supervised speech models, effectively reducing the performance gap caused by missing future context in online processing.
Contribution
The paper proposes learnable online registers and a future prediction loss to enhance online self-supervised speech models, addressing attention mismatch issues without increasing latency.
Findings
Online registers improve online model performance by 3.4% on LibriSpeech.
The approach reduces the performance gap between offline and online modes.
Effective in low-latency streaming scenarios.
Abstract
Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Domain Adaptation and Few-Shot Learning
