A Dataset for Building Code-Mixed Goal Oriented Conversation Systems
Suman Banerjee, Nikita Moghe, Siddhartha Arora, Mitesh M. Khapra

TL;DR
This paper introduces a new dataset of goal-oriented, code-mixed conversations in multiple Indian languages to support the development of multilingual dialogue systems, filling a significant gap in existing resources.
Contribution
The authors create and release a novel multilingual, code-mixed dialogue dataset based on DSTC2, along with baseline models for building such systems.
Findings
Dataset covers Hindi-English, Bengali-English, Gujarati-English, Tamil-English conversations.
Baseline models demonstrate initial performance on code-mixed dialogue tasks.
Publicly available dataset facilitates future research in multilingual conversational AI.
Abstract
There is an increasing demand for goal-oriented conversation systems which can assist users in various day-to-day activities such as booking tickets, restaurant reservations, shopping, etc. Most of the existing datasets for building such conversation systems focus on monolingual conversations and there is hardly any work on multilingual and/or code-mixed conversations. Such datasets and systems thus do not cater to the multilingual regions of the world, such as India, where it is very common for people to speak more than one language and seamlessly switch between them resulting in code-mixed conversations. For example, a Hindi speaking user looking to book a restaurant would typically ask, "Kya tum is restaurant mein ek table book karne mein meri help karoge?" ("Can you help me in booking a table at this restaurant?"). To facilitate the development of such code-mixed conversation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
