Beyond Words: Multimodal LLM Knows When to Speak

Zikai Liao; Yi Ouyang; Yi-Lun Lee; Chen-Ping Yu; Yi-Hsuan Tsai; Zhaozheng Yin

arXiv:2505.14654·cs.CV·May 21, 2026

Beyond Words: Multimodal LLM Knows When to Speak

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

PDF

3 Reviews

TL;DR

This paper introduces a multimodal approach for large language models to better determine when to speak during conversations, using synchronized video, audio, and text cues to improve timing accuracy.

Contribution

It presents a novel multimodal strategy and dataset for enhancing LLMs' conversational timing awareness, significantly improving response type prediction performance.

Findings

01

Up to 3x improvement in response type prediction accuracy.

02

Multimodal perception enhances conversational naturalness.

03

A new dataset with aligned modalities and reaction annotations.

Abstract

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper introduces a new task that predicts when a conversation model should speak by jointly leveraging text, audio, and video modalities. 2. The authors build a new dataset by collecting real dyadic conversation videos from YouTube and annotating each segment with nine response types (seven short reaction categories, full response, and silence). This dataset is specifically designed to support fine-grained modeling of response timing in natural conversations. 3. The proposed MM-When2Speak

Weaknesses

1. Although the collected dyadic conversation dataset is carefully curated and of high quality, its overall scale is rather limited. In addition, the data are confined to dyadic interactions, leaving more complex multi-party conversations unexplored. Even if the model is trained on dyadic exchanges, evaluating it on multi-party dialogue (small sample like full-video split) could provide meaningful insights into its generalization capability. 2. The model addresses only when to speak, not how to

Reviewer 02Rating 2Confidence 4

Strengths

Targeting the underexplored yet vital problem of “when to speak” (vs. “what to say”) in human–AI interaction, the paper reports consistent gains across datasets, modalities, and strong baselines with informative ablations, adopts sliding-window inference for online deployability, and documents transparent data construction and labeling (including confusion matrices and class-wise recall).

Weaknesses

1. The pipeline figure is ambiguous—window length/stride and label–time alignment are unspecified, the transcript appears stretched across multiple audio windows, and it’s unclear whether outputs are per-window predictions or post-processed events. 2. Lacks qualitative case analyses and a breakdown of reaction/query diversity (per-class distribution, long-tail behavior, representative successes/failures). 3. No end-to-end latency or memory profiling in realistic settings; please quantify per 10-

Reviewer 03Rating 6Confidence 3

Strengths

1. The problem is practical and underexplored, most dialogue systems focus on what to say, but rarely model when they should respond or remain silent. 2. The dataset is a useful contribution: multimodal, time-aligned, and supported by high-quality human-verified annotations. 3. The method is straightforward and computationally lightweight, making real-time deployment feasible. 4. Experimental results indicate clear improvements over strong multimodal LLM baselines across multiple settings.

Weaknesses

1. One concern is the practical usefulness of the proposed model in real applications. Since it only outputs a response type rather than the actual content, it seems more like an auxiliary module that provides a triggering signal for another system to generate the verbal response. The paper does not discuss how this module integrates with a full conversational pipeline. 2. The evaluation does not report latency or real-time efficiency. Given that the model is intended for online interaction, en

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices · Lexicography and Language Studies

MethodsFocus