Effective Context in Neural Speech Models
Yen Meng, Sharon Goldwater, Hao Tang

TL;DR
This paper introduces methods to measure the actual context used by neural speech models, revealing how different models and tasks utilize context and enabling streaming without architecture changes.
Contribution
It proposes two approaches to quantify effective context in speech Transformers and demonstrates their application across supervised and self-supervised models.
Findings
Effective context correlates with task complexity.
Self-supervised models' effective context is limited to early layers.
HuBERT can operate in streaming mode without modifications.
Abstract
Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short -- similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Topic Modeling
