TL;DR
This paper introduces the first open, full-duplex conversational system for Hindi, trained on extensive real-world data, enabling natural turn-taking and overlaps in Indian language dialogue systems.
Contribution
It adapts a state-of-the-art duplex speech architecture for Hindi, using a large dataset and a two-stage training process to model natural conversational behaviours.
Findings
Model generates natural and meaningful Hindi conversations.
Evaluation shows improved dialogue continuation with automatic and human metrics.
First open-source full-duplex system for Hindi with real-world data.
Abstract
Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
