BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin, Rao Koluguri, Piotr \.Zelasko, Jagadeesh Balam, Boris Ginsburg

TL;DR
BESTOW is a novel speech language model that combines GPT and T5 architectures to enable efficient, streamable, and multitask speech understanding, surpassing previous models in performance and flexibility.
Contribution
The paper introduces BESTOW, a unified architecture that integrates GPT and T5 features for streamable, multitask speech understanding, and provides the first open-source solution for scalable streaming SpeechLLM.
Findings
Achieves strong performance across multiple speech tasks.
Supports streaming and multitask capabilities simultaneously.
Reduces training and inference costs.
Abstract
Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
