LLaSM: Large Language and Speech Model
Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen, Shi, Qiqi Xiang, Yemin Shi

TL;DR
LLaSM introduces a multi-modal speech and language model capable of following complex instructions, emphasizing speech as a vital modality for human-AI interaction, supported by a new dataset and initial experiments.
Contribution
It presents the first end-to-end trained large speech-language model with cross-modal conversational abilities and releases a new speech instruction dataset.
Findings
LLaSM demonstrates effective speech-language instruction following.
The model offers a more natural interaction method.
Early experiments show promising performance.
Abstract
Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech and dialogue systems
MethodsFocus
