Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan; Hengyang Zhou; Xiangdong Li; Yuning Wang; Ye Lou; Jiatong Pan; Ji Zhou; Wei Zhang

arXiv:2604.25618·cs.MM·April 29, 2026

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan, Hengyang Zhou, Xiangdong Li, Yuning Wang, Ye Lou, Jiatong Pan, Ji Zhou, Wei Zhang

PDF

TL;DR

This paper introduces CUCI-Net, a novel approach for conversational multimodal understanding that explicitly models context-utterance dependency using interpretation cues, improving context-conditioned predictions.

Contribution

The paper proposes CUCI-Net, which preserves structural distinctions and explicitly abstracts context-utterance dependency into cues for better multimodal reasoning.

Findings

01

CUCI-Net outperforms existing methods on benchmark datasets.

02

Explicit cue modeling enhances context-dependent multimodal understanding.

03

The approach effectively combines local and global evidence for interpretation cues.

Abstract

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.