WavLLM: Towards Robust and Adaptive Speech Large Language Model

Shujie Hu; Long Zhou; Shujie Liu; Sanyuan Chen; Lingwei Meng; Hongkun; Hao; Jing Pan; Xunying Liu; Jinyu Li; Sunit Sivasankaran; Linquan Liu; Furu; Wei

arXiv:2404.00656·cs.CL·September 24, 2024·2 cites

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun, Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu, Wei

PDF

Open Access 1 Repo

TL;DR

WavLLM is a novel speech large language model that integrates dual encoders and curriculum learning to achieve state-of-the-art performance across diverse speech tasks, demonstrating robustness and adaptability without task-specific training.

Contribution

The paper introduces WavLLM, a robust and adaptive speech LLM with dual encoders and a prompt-aware LoRA adapter, trained via curriculum learning for broad speech task generalization.

Findings

01

Achieves state-of-the-art results on multiple speech benchmarks.

02

Successfully completes Gaokao English listening tasks without specialized training.

03

Demonstrates robust generalization and complex task execution capabilities.

Abstract

The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/speecht5
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsSparse Evolutionary Training · Adapter