ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

Xingchen Song; Di Wu; Binbin Zhang; Zhendong Peng; Bo Dang; Fuping; Pan; Zhiyong Wu

arXiv:2305.10649·cs.SD·October 10, 2023·1 cites

ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

Xingchen Song, Di Wu, Binbin Zhang, Zhendong Peng, Bo Dang, Fuping, Pan, Zhiyong Wu

PDF

Open Access

TL;DR

ZeroPrompt introduces a training-free method to reduce token display time in streaming ASR models by appending zeroed prompts during inference, achieving significant latency reduction without accuracy loss.

Contribution

The paper demonstrates that streaming acoustic encoders inherently function as masked language models and proposes ZeroPrompt as a simple, effective, and dataset-agnostic latency reduction technique.

Findings

01

350-700ms reduction in first token display time

02

100-400ms reduction in last token display time

03

No accuracy loss on Aishell-1 and Librispeech datasets

Abstract

In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding Prompt-and-Refine strategy (Figure 3), two simple but effective \textbf{training-free} methods to decrease the Token Display Time (TDT) of streaming ASR models \textbf{without any accuracy loss}. The core idea of ZeroPrompt is to append zeroed content to each chunk during inference, which acts like a prompt to encourage the model to predict future tokens even before they were spoken. We argue that streaming acoustic encoders naturally have the modeling ability of Masked Language Models and our experiments demonstrate that ZeroPrompt is engineering cheap and can be applied to streaming acoustic encoders on any dataset without any accuracy loss. Specifically, compared with our baseline models, we achieve 350 $\sim$ 700ms reduction on First Token Display Time (TDT-F) and 100 $\sim$ 400ms reduction on Last Token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing