TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based   ASR

Shashi Kumar; Srikanth Madikeri; Juan Zuluaga-Gomez; Iuliia Thorbecke,; Esa\'u Villatoro-Tello; Sergio Burdisso; Petr Motlicek; Karthik Pandia,; Aravind Ganapathiraju

arXiv:2407.04444·cs.CL·October 10, 2024

TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Thorbecke,, Esa\'u Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia,, Aravind Ganapathiraju

PDF

Open Access

TL;DR

TokenVerse introduces a unified transducer-based model that integrates multiple speech and NLP tasks into a single system, improving efficiency and performance over traditional cascaded pipelines.

Contribution

This paper presents the first unified transducer model that handles ASR, speaker change detection, endpointing, and NER simultaneously, streamlining conversational AI pipelines.

Findings

01

Up to 7.7% relative WER improvement.

02

Outperforms cascaded pipeline in individual tasks.

03

Effective on both public and private datasets.

Abstract

In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis