Accelerating Transducers through Adjacent Token Merging

Yuang Li; Yu Wu; Jinyu Li; Shujie Liu

arXiv:2306.16009·cs.CL·June 29, 2023

Accelerating Transducers through Adjacent Token Merging

Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

PDF

Open Access

TL;DR

This paper introduces Adjacent Token Merging (A-ToMe), a method to reduce token count in Transformer-based ASR systems, significantly accelerating inference speed without accuracy loss, especially for long speech signals.

Contribution

The paper proposes A-ToMe, a novel token merging technique that reduces computational complexity and speeds up inference in speech recognition models.

Findings

01

Reduces 57% of tokens in LibriSpeech experiments.

02

Achieves 70% faster inference on GPU.

03

Effective in long-form ASR with multiple utterances.

Abstract

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings