Investigating the Effects of Large-Scale Pseudo-Stereo Data and   Different Speech Foundation Model on Dialogue Generative Spoken Language   Model

Yu-Kuan Fu; Cheng-Kuang Lee; Hsiu-Hsuan Wang; Hung-yi Lee

arXiv:2407.01911·cs.CL·July 3, 2024

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

Yu-Kuan Fu, Cheng-Kuang Lee, Hsiu-Hsuan Wang, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces a pipeline to generate pseudo-stereo data from single-channel speech, greatly expanding training datasets for spoken dialogue models, and evaluates the impact of different speech foundation models on dialogue generation performance.

Contribution

We developed a novel pipeline to convert single-channel speech into pseudo-stereo data, significantly increasing training data and improving spoken dialogue model performance.

Findings

01

Pseudo-stereo data improves dialogue model accuracy

02

Training data increased from 2,000 to 17,600 hours

03

Different speech foundation models affect dialogue generation quality

Abstract

Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech. However, this approach faces a challenge when speakers talk simultaneously, requiring stereo dialogue data with speakers recorded on separate channels, a notably scarce resource. To address this, we have developed an innovative pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours, significantly enriching the diversity and quality of the training examples available. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models. Additionally, we explored the use of discrete units of different speech foundation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Speech and dialogue systems · Diverse Interdisciplinary Research Innovations