TL;DR
This paper introduces a multimodal approach for large language models to better determine when to speak during conversations, using synchronized video, audio, and text cues to improve timing accuracy.
Contribution
It presents a novel multimodal strategy and dataset for enhancing LLMs' conversational timing awareness, significantly improving response type prediction performance.
Findings
Up to 3x improvement in response type prediction accuracy.
Multimodal perception enhances conversational naturalness.
A new dataset with aligned modalities and reaction annotations.
Abstract
Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper introduces a new task that predicts when a conversation model should speak by jointly leveraging text, audio, and video modalities. 2. The authors build a new dataset by collecting real dyadic conversation videos from YouTube and annotating each segment with nine response types (seven short reaction categories, full response, and silence). This dataset is specifically designed to support fine-grained modeling of response timing in natural conversations. 3. The proposed MM-When2Speak
1. Although the collected dyadic conversation dataset is carefully curated and of high quality, its overall scale is rather limited. In addition, the data are confined to dyadic interactions, leaving more complex multi-party conversations unexplored. Even if the model is trained on dyadic exchanges, evaluating it on multi-party dialogue (small sample like full-video split) could provide meaningful insights into its generalization capability. 2. The model addresses only when to speak, not how to
Targeting the underexplored yet vital problem of “when to speak” (vs. “what to say”) in human–AI interaction, the paper reports consistent gains across datasets, modalities, and strong baselines with informative ablations, adopts sliding-window inference for online deployability, and documents transparent data construction and labeling (including confusion matrices and class-wise recall).
1. The pipeline figure is ambiguous—window length/stride and label–time alignment are unspecified, the transcript appears stretched across multiple audio windows, and it’s unclear whether outputs are per-window predictions or post-processed events. 2. Lacks qualitative case analyses and a breakdown of reaction/query diversity (per-class distribution, long-tail behavior, representative successes/failures). 3. No end-to-end latency or memory profiling in realistic settings; please quantify per 10-
1. The problem is practical and underexplored, most dialogue systems focus on what to say, but rarely model when they should respond or remain silent. 2. The dataset is a useful contribution: multimodal, time-aligned, and supported by high-quality human-verified annotations. 3. The method is straightforward and computationally lightweight, making real-time deployment feasible. 4. Experimental results indicate clear improvements over strong multimodal LLM baselines across multiple settings.
1. One concern is the practical usefulness of the proposed model in real applications. Since it only outputs a response type rather than the actual content, it seems more like an auxiliary module that provides a triggering signal for another system to generate the verbal response. The paper does not discuss how this module integrates with a full conversational pipeline. 2. The evaluation does not report latency or real-time efficiency. Given that the model is intended for online interaction, en
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Lexicography and Language Studies
MethodsFocus
