PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Mohammad Rifqi Farhansyah; Hanif Muhammad Zhafran; Farid Adilazuarda; Shamsuddeen Hassan Muhammad; Maryam Ibrahim Mukhtar; Nedjma Ousidhoum; Genta Indra Winata; Ayu Purwarianti; Alham Fikri Aji

arXiv:2601.17277·cs.CL·January 27, 2026

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Mohammad Rifqi Farhansyah, Hanif Muhammad Zhafran, Farid Adilazuarda, Shamsuddeen Hassan Muhammad, Maryam Ibrahim Mukhtar, Nedjma Ousidhoum, Genta Indra Winata, Ayu Purwarianti, Alham Fikri Aji

PDF

Open Access

TL;DR

PingPong introduces a comprehensive, natural benchmark dataset for multi-turn, multi-party code-switching dialogues across multiple languages, highlighting challenges for current NLP models in understanding complex multilingual conversations.

Contribution

The paper presents PingPong, a novel dataset capturing authentic, multi-party code-switching dialogues with diverse structures and tasks, filling a gap in existing benchmarks.

Findings

01

Current models perform poorly on code-switched dialogues

02

The dataset is more natural and diverse than machine-generated data

03

Three downstream tasks demonstrate the complexity of real-world multilingual conversations

Abstract

Code-switching is a widespread practice among the world's multilingual majority, yet few benchmarks accurately reflect its complexity in everyday communication. We present PingPong, a benchmark for natural multi-party code-switching dialogues covering five language-combination variations, some of which are trilingual. Our dataset consists of human-authored conversations among 2 to 4 participants covering authentic, multi-threaded structures where replies frequently reference much earlier points in the dialogue. We demonstrate that our data is significantly more natural and structurally diverse than machine-generated alternatives, offering greater variation in message length, speaker dominance, and reply distance. Based on these dialogues, we define three downstream tasks: Question Answering, Dialogue Summarization, and Topic Classification. Evaluations of several state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques