SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information   Toward Environment-adaptive Dialogue Speech Synthesis

Osamu Take; Shinnosuke Takamichi; Kentaro Seki; Yoshiaki Bando,; Hiroshi Saruwatari

arXiv:2408.06858·eess.AS·August 14, 2024·Interspeech

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis

Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando,, Hiroshi Saruwatari

PDF

Open Access 1 Repo

TL;DR

SaSLaW is a novel speech corpus capturing audio-visual egocentric data in dialogues, enabling the development of environment-adaptive speech synthesis models that produce more natural speech tailored to diverse audio settings.

Contribution

The paper introduces SaSLaW, a new dialogue speech corpus with synchronized audio-visual data, and demonstrates its use in training models that adapt speech to different environments.

Findings

01

Models with hearing-audio data produce more plausible, environment-adapted speech.

02

SaSLaW enables analysis of human speech adjustments in diverse audio-visual contexts.

03

Experiment results show improved speech naturalness with environment-aware models.

Abstract

This paper presents SaSLaW, a spontaneous dialogue speech corpus containing synchronous recordings of what speakers speak, listen to, and watch. Humans consider the diverse environmental factors and then control the features of their utterances in face-to-face voice communications. Spoken dialogue systems capable of this adaptation to these audio environments enable natural and seamless communications. SaSLaW was developed to model human-speech adjustment for audio environments via first-person audio-visual perceptions in spontaneous dialogues. We propose the construction methodology of SaSLaW and display the analysis result of the corpus. We additionally conducted an experiment to develop text-to-speech models using SaSLaW and evaluate their performance of adaptations to audio environments. The results indicate that models incorporating hearing-audio data output more plausible speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarulab-speech/saslaw
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media