Instruction Data Generation and Unsupervised Adaptation for Speech Language Models
Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang,, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper introduces three methods for generating synthetic speech and text data to improve multimodal large language models, addressing data scarcity and enabling better cross-modal understanding.
Contribution
The paper presents novel synthetic data generation techniques using large language models and text-to-speech systems for training multimodal speech-language models.
Findings
Improved integrated understanding of text and speech.
Synthetic data quality comparable to real transcriptions.
Potential to expand models to more languages using unlabeled speech.
Abstract
In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
