CJST: CTC Compressor based Joint Speech and Text Training for   Decoder-Only ASR

Wei Zhou; Junteng Jia; Leda Sari; Jay Mahadeokar; Ozlem Kalinli

arXiv:2411.07607·eess.AS·January 3, 2025

CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR

Wei Zhou, Junteng Jia, Leda Sari, Jay Mahadeokar, Ozlem Kalinli

PDF

Open Access

TL;DR

This paper introduces CJST, a novel joint speech and text training framework using CTC compressor for decoder-only ASR, improving text injection and robustness across various conditions.

Contribution

The paper proposes a new CJST framework that leverages CTC compressor features for effective joint speech and text training in decoder-only ASR models.

Findings

01

Achieves state-of-the-art performance on Librispeech and TED-LIUM2.

02

Effectively injects text without duration handling.

03

Provides a comprehensive analysis of CTC compressor behavior.

Abstract

CTC compressor can be an effective approach to integrate audio encoders to decoder-only models, which has gained growing interest for different speech applications. In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. CJST matches speech and text modalities from both directions by exploring a simple modality adaptor and several features of the CTC compressor, including sequence compression, on-the-fly forced peaky alignment and CTC class embeddings. Experimental results on the Librispeech and TED-LIUM2 corpora show that the proposed CJST achieves an effective text injection without the need of duration handling, leading to the best performance for both in-domain and cross-domain scenarios. We also provide a comprehensive study on CTC compressor, covering various compression modes, edge case handling and behavior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis