Seal: Advancing Speech Language Models to be Few-Shot Learners
Shuyu Lei, Lingen Liu, Jiaolong Yang, Yasen Jiao, Yuxiang Yang, Yushu, Yang, Xiang Guo

TL;DR
Seal is a novel speech language model that enhances few-shot learning capabilities in a multi-modal setting by aligning speech and language models through a specialized training method, demonstrating robustness across tasks.
Contribution
The paper introduces Seal, a multi-modal speech language model that uses a novel alignment technique to enable effective few-shot learning in speech understanding tasks.
Findings
Seal performs robustly as a few-shot learner on speech tasks.
The alignment method improves cross-modal transfer and robustness.
Experiments validate effectiveness across different language models.
Abstract
Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
