BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jinliang Lu, Junhong Wu,, Yuchen Liu, Chengqing Zong, Jiajun Zhang

TL;DR
This paper introduces BLSP, a novel approach that aligns speech and text behaviors in language models by training a modality adapter, enabling speech-related tasks without extensive speech instruction data.
Contribution
BLSP proposes a lightweight modality adapter trained via behavior alignment, bridging speech and text in LLMs without relying on large speech instruction datasets.
Findings
Enables speech recognition, translation, and understanding with LLMs.
Supports zero-shot cross-lingual speech tasks.
Does not require large-scale speech instruction data.
Abstract
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsAdapter
