BESTOW: Efficient and Streamable Speech Language Model with the Best of   Two Worlds in GPT and T5

Zhehuai Chen; He Huang; Oleksii Hrinchuk; Krishna C. Puvvada; Nithin; Rao Koluguri; Piotr \.Zelasko; Jagadeesh Balam; Boris Ginsburg

arXiv:2406.19954·cs.CL·July 1, 2024

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin, Rao Koluguri, Piotr \.Zelasko, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

BESTOW is a novel speech language model that combines GPT and T5 architectures to enable efficient, streamable, and multitask speech understanding, surpassing previous models in performance and flexibility.

Contribution

The paper introduces BESTOW, a unified architecture that integrates GPT and T5 features for streamable, multitask speech understanding, and provides the first open-source solution for scalable streaming SpeechLLM.

Findings

01

Achieves strong performance across multiple speech tasks.

02

Supports streaming and multitask capabilities simultaneously.

03

Reduces training and inference costs.

Abstract

Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques