CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword   Spotting

Sichen Jin; Youngmoon Jung; Seungjin Lee; Jaeyoung Roh; Changwoo Han,; Hoonyoung Cho

arXiv:2406.07923·cs.SD·September 27, 2024

CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting

Sichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han,, Hoonyoung Cho

PDF

TL;DR

This paper presents a streaming open-vocabulary keyword spotting method that dynamically aligns audio and text embeddings using CTC, achieving competitive accuracy with low latency and model complexity.

Contribution

It introduces the first dynamic CTC-based alignment approach for joint audio-text embedding in streaming keyword spotting, enabling real-time open-vocabulary detection.

Findings

01

Achieves competitive performance on LibriPhrase dataset.

02

Uses only 155K model parameters.

03

Operates with O(U) decoding complexity.

Abstract

This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment. For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC) and aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text. After that, we calculate the similarity of the aggregated AE and the TE. To the best of our knowledge, this is the first attempt to dynamically align the audio and the keyword text on-the-fly to attain the joint audio-text embedding for KWS. Despite operating in a streaming fashion, our approach achieves competitive performance on the LibriPhrase dataset compared to the non-streaming methods with a mere 155K model parameters and a decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN · Autoencoders