SpeechCaps: Advancing Instruction-Based Universal Speech Models with   Multi-Talker Speaking Style Captioning

Chien-yu Huang; Min-Han Shih; Ke-Han Lu; Chi-Yuan Hsiao; Hung-yi Lee

arXiv:2408.13891·cs.CL·August 27, 2024

SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Chien-yu Huang, Min-Han Shih, Ke-Han Lu, Chi-Yuan Hsiao, Hung-yi Lee

PDF

Open Access 1 Repo

TL;DR

SpeechCaps introduces a multi-talker speaking style captioning task to improve instruction-based speech models, leveraging large language models for data generation and demonstrating enhanced performance in speaker and emotion recognition.

Contribution

The paper proposes a novel multi-talker speaking style captioning task and a training pipeline combining pre-training and instruction tuning for universal speech models.

Findings

01

Outperforms single-talker pre-trained models in speaker and emotion recognition

02

Enhances understanding of speaker and prosodic information

03

Current models struggle with gender, pitch, and speaking rate attributes

Abstract

Instruction-based speech processing is becoming popular. Studies show that training with multiple tasks boosts performance, but collecting diverse, large-scale tasks and datasets is expensive. Thus, it is highly desirable to design a fundamental task that benefits other downstream tasks. This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information. We used large language models to generate descriptions for multi-talker speech. Then, we trained our model with pre-training on this captioning task followed by instruction tuning. Evaluation on Dynamic-SUPERB shows our model outperforming the baseline pre-trained only on single-talker tasks, particularly in speaker and emotion recognition. Additionally, tests on a multi-talker QA task reveal that current models struggle with attributes such as gender, pitch, and speaking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cyhuang-tw/speechcaps
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Speech Recognition and Synthesis