J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Wataru Nakata; Kentaro Seki; Hitomi Yanaka; Yuki Saito; Shinnosuke Takamichi; Hiroshi Saruwatari

arXiv:2407.15828·cs.CL·April 3, 2026

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Wataru Nakata, Kentaro Seki, Hitomi Yanaka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

PDF

4 Models 1 Datasets

TL;DR

J-CHAT is a large-scale, high-quality Japanese spoken dialogue corpus designed to improve spoken dialogue systems by providing diverse, spontaneous, and acoustically clean data for training advanced models.

Contribution

The paper introduces J-CHAT, a 76,000-hour open-source Japanese spoken dialogue corpus created with an automated, language-independent methodology for enhanced SDS development.

Findings

01

Generative models trained on J-CHAT show improved dialogue quality.

02

J-CHAT's diversity and spontaneity benefit spoken dialogue system training.

03

The corpus's quality filtering enhances model performance.

Abstract

Spoken dialogue is essential for human-AI interactions, providing expressive capabilities beyond text. Developing effective spoken dialogue systems (SDSs) requires large-scale, high-quality, and diverse spoken dialogue corpora. However, existing datasets are often limited in size, spontaneity, or linguistic coherence. To address these limitations, we introduce J-CHAT, a 76,000-hour open-source Japanese spoken dialogue corpus. Constructed using an automated, language-independent methodology, J-CHAT ensures acoustic cleanliness, diversity, and natural spontaneity. The corpus is built from YouTube and podcast data, with extensive filtering and denoising to enhance quality. Experimental results with generative spoken dialogue language models trained on J-CHAT demonstrate its effectiveness for SDS development. By providing a robust foundation for training advanced dialogue models, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

sarulab-speech/J-CHAT
dataset· 183 dl
183 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.