SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis
Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando,, Hiroshi Saruwatari

TL;DR
SaSLaW is a novel speech corpus capturing audio-visual egocentric data in dialogues, enabling the development of environment-adaptive speech synthesis models that produce more natural speech tailored to diverse audio settings.
Contribution
The paper introduces SaSLaW, a new dialogue speech corpus with synchronized audio-visual data, and demonstrates its use in training models that adapt speech to different environments.
Findings
Models with hearing-audio data produce more plausible, environment-adapted speech.
SaSLaW enables analysis of human speech adjustments in diverse audio-visual contexts.
Experiment results show improved speech naturalness with environment-aware models.
Abstract
This paper presents SaSLaW, a spontaneous dialogue speech corpus containing synchronous recordings of what speakers speak, listen to, and watch. Humans consider the diverse environmental factors and then control the features of their utterances in face-to-face voice communications. Spoken dialogue systems capable of this adaptation to these audio environments enable natural and seamless communications. SaSLaW was developed to model human-speech adjustment for audio environments via first-person audio-visual perceptions in spontaneous dialogues. We propose the construction methodology of SaSLaW and display the analysis result of the corpus. We additionally conducted an experiment to develop text-to-speech models using SaSLaW and evaluate their performance of adaptations to audio environments. The results indicate that models incorporating hearing-audio data output more plausible speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
