Towards a Japanese Full-duplex Spoken Dialogue System
Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka

TL;DR
This paper introduces the first Japanese full-duplex spoken dialogue system, built upon an English model, trained with large-scale data, and enhanced with synthetic data, showing improved naturalness and meaningfulness.
Contribution
It presents the first publicly available Japanese full-duplex dialogue model based on Moshi, with a novel two-stage training process and synthetic data augmentation.
Findings
Outperforms baseline models in naturalness
Outperforms baseline models in meaningfulness
Effective use of synthetic dialogue data
Abstract
Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first publicly available full-duplex spoken dialogue model in Japanese, which is built upon Moshi, a full-duplex dialogue model in English. Our model is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the model's performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system. Evaluation experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation
MethodsSoftmax · Attention Is All You Need
