Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents

Andrew H. Lee; Sina J. Semnani; Galo Castillo-L\'opez; G\"ael de; Chalendar; Monojit Choudhury; Ashna Dua; Kapil Rajesh Kavitha; Sungkyun Kim,; Prashant Kodali; Ponnurangam Kumaraguru; Alexis Lombard; Mehrad Moradshahi,; Gihyun Park; Nasredine Semmar; Jiwon Seo; Tianhao Shen; Manish Shrivastava,; Deyi Xiong; Monica S. Lam

arXiv:2405.17840·cs.CL·June 18, 2024

Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents

Andrew H. Lee, Sina J. Semnani, Galo Castillo-L\'opez, G\"ael de, Chalendar, Monojit Choudhury, Ashna Dua, Kapil Rajesh Kavitha, Sungkyun Kim,, Prashant Kodali, Ponnurangam Kumaraguru, Alexis Lombard, Mehrad Moradshahi,, Gihyun Park, Nasredine Semmar, Jiwon Seo, Tianhao Shen

PDF

Open Access

TL;DR

This paper demonstrates that in-context learning with GPT-4 can effectively handle multilingual dialogue tasks, but current benchmarks underestimate its true performance due to annotation errors and metric limitations.

Contribution

It introduces a novel approach using in-context learning for multilingual dialogue state tracking and response generation, revealing that benchmarks undervalue its effectiveness.

Findings

01

GPT-4 achieves 89.6%-96.8% DST accuracy after correction

02

Response generation correctness exceeds 99% with improved evaluation

03

Current benchmarks significantly underestimate in-context learning capabilities

Abstract

Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Topic Modeling

MethodsAttention Is All You Need · Dynamic Sparse Training · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout