Efficient Long-Form Speech Recognition for General Speech In-Context   Learning

Hao Yen; Shaoshi Ling; Guoli Ye

arXiv:2409.19757·eess.AS·October 1, 2024

Efficient Long-Form Speech Recognition for General Speech In-Context Learning

Hao Yen, Shaoshi Ling, Guoli Ye

PDF

Open Access

TL;DR

This paper introduces SICL-AED, an attention-based speech recognition model that efficiently handles long-form speech, speaker adaptation, and contextual biasing, achieving significant accuracy improvements and reduced computational complexity.

Contribution

The paper presents a novel attention-based end-to-end speech recognition model with in-context learning capabilities, enabling efficient long-form decoding and test-time adaptation without extensive fine-tuning.

Findings

01

8.64% relative WER reduction on TEDLIUM3

02

Comparable performance to traditional models with less runtime and memory

03

64% increase in entity recall for contextual biasing

Abstract

We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual biasing. Specifically, we introduce an attention-based encoder-decoder (AED) model with SICL capability (referred to as SICL-AED), where the decoder utilizes an utterance-level cross-attention to integrate information from the encoder's output efficiently, and a document-level self-attention to learn contextual information. Evaluated on the benchmark TEDLIUM3 dataset, SICL-AED achieves an 8.64% relative word error rate (WER) reduction compared to a baseline utterance-level AED model by leveraging previously decoded outputs as in-context examples. It also demonstrates comparable performance to conventional long-form AED systems with significantly reduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing