DialSim: A Dialogue Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents
Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yeonsu Kwon, Yohan Jo, Edward Choi

TL;DR
DialSim is a novel dialogue simulation framework designed to evaluate conversational agents in complex, long-term multi-party scenarios, revealing current models' limitations in understanding extended dialogue contexts.
Contribution
The paper introduces DialSim, a new evaluation framework and LongDialQA dataset for assessing multi-party dialogue understanding in conversational AI.
Findings
State-of-the-art LLMs struggle with long-term, multi-party dialogue comprehension.
DialSim reveals limitations of large context window models in maintaining accurate understanding.
The framework highlights the need for more realistic benchmarks in conversational AI evaluation.
Abstract
Recent advancements in Large Language Models (LLMs) have significantly enhanced conversational agents, making them applicable to various fields (e.g., education, entertainment). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as multi-party dialogues and extended contextual dependencies. To bridge this gap, we introduce DialSim, a dialogue simulation-based evaluation framework. In DialSim, an agent assumes the role of a character in a scripted conversation and is evaluated on their ability to answer spontaneous questions using only the dialogue history, while recognizing when they lack sufficient information. To support this framework, we introduce LongDialQA, a new QA dataset constructed from long-running TV shows, comprising over 1,300 dialogue sessions, each paired with more than 1,000 carefully curated…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents an effective simulation-based evaluation paradigm for long-term multi-party dialogue understanding with specially designed questions and answer choices (with "I don't know" choice). 2. The paper provides a large-scale, multi-party, long-horizon dataset that spans five seasons across three shows, with 1300+ sessions, ~352K tokens, and 1000+ curated questions per session, surpassing prior two-party, shorter datasets. Empirical results show that all models score under 60% on
1. The dataset is built from three very popular entertainment-focused TV shows, which limits its generalizability across domains and applications in real-world scenarios. 2. Even with anonymization or name swapping, prior knowledge is not fully eliminated. Models can still leverage memorized show knowledge. Swapping names is not a robust and effective adversarial strategy because it may create contradictions with memorized character attributes rather than blocking access to prior facts. Consid
1. Realistic long-horizon, multi-party setup with abstention. The simulator blends long context, multiple speakers, and uncertainty (“I don’t know”) 2. Careful dataset construction and controls. The combination of fan-quiz evidence mapping, TKG-driven multi-hop questions, character style transfer, and anonymization/adversarial name swapping is thoughtful
1. Retrieval baselines feel narrow. RAG uses BM25 and vendor embeddings; no dense retrievers tuned for dialogue, no temporal/speaker-aware indexing, and no hierarchical retrieval. Given the oracle gap (+10–30%), richer retrievers could materially change conclusions. Few options to consider: Contriever (unsupervised dense IR), ColBERTv2 (late-interaction), and E5 family (text-embedding models for retrieval). 2. Question timing and asker selection may distort conversational realism. The schedule
- The paper's focus on long, multi-party chats is a setup that is needed and is usable in real-world evaluation. - I like the inclusion of questions that are not answerable. Making the models learn when to say "I don't know" is a separate, a very interesting, research direction in itself. - The paper is very well written and is easy to follow.
- Although the authors anonymize the chats by changing or removing speaker names from dialogues, the models could still have prior knowledge about the sitcoms by analyzing the context of the chat. Moreover, the use of TV show scripts does not reflect real world conversations well, which are much more messy. Although DialSim is an interesting evalaution setup, testing it on other real-world data (like meeting notes, etc) could say much more about its efficiency. - The oracle test shows higher sco
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Speech and dialogue systems · Social Robot Interaction and HRI
