WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao; Chenda Liao; Changliang Liu; Matthew Wiesner; Leibny Paola Garcia; Daniel Povey; Sanjeev Khudanpur; Jian Wu

arXiv:2511.04035·cs.CL·November 7, 2025

WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur, Jian Wu

PDF

Open Access

TL;DR

This paper introduces WST, a weakly supervised transducer model for speech recognition that is robust to high transcription error rates and reduces reliance on costly annotated data, outperforming existing methods.

Contribution

The paper presents a novel WST model that handles transcription errors without extra confidence estimation, improving weakly supervised ASR performance.

Findings

01

WST maintains performance with up to 70% transcription errors.

02

WST outperforms BTC and OTC in robustness and accuracy.

03

Implementation will be publicly available.

Abstract

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing