SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Anticipation?
Yueyi Yang, Haotian Liu, Fang Kang, Mengqi Zhang, Zheng Lian, Hao Tang, Haoyu Chen

TL;DR
This paper introduces SayNext-Bench, a benchmark and dataset for evaluating large language models on next-utterance anticipation in dialogue, highlighting the importance of multimodal cues and proposing a new cognitively inspired model.
Contribution
It presents a new benchmark, dataset, and a dual-route MLLM that incorporates perceptual cues, advancing the understanding of multimodal anticipation in dialogue models.
Findings
SayNext-Chat outperforms state-of-the-art MLLMs across evaluation levels.
Multimodal cues significantly improve next-utterance anticipation.
Active anticipatory processing is crucial for natural human-like dialogue.
Abstract
We explore the use of large language models (LLMs) for next-utterance anticipation in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to anticipate a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues -- such as gestures, gaze, and emotional tone -- from the context. To systematically examine this gap, we propose SayNext-Bench, a benchmark evaluating MLLMs on anticipating context-conditioned responses across diverse real-world scenarios. To support it, we build SayNext-PC, a large-scale multimodal dialogue dataset, and carefully design a multi-level evaluation framework spanning lexical similarity, emotion-intention consistency, and LLM-based overall alignment. Building on this, we develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
