Multimodal Conversation Structure Understanding

Kent K. Chang; Mackenzie Hanh Cramer; Anna Ho; Ti Ti Nguyen; Yilin Yuan; David Bamman

arXiv:2505.17536·cs.CL·January 29, 2026

Multimodal Conversation Structure Understanding

Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman

PDF

Open Access 1 Video

TL;DR

This paper introduces a new dataset and tasks for understanding the structure of multimodal conversations, revealing challenges in role identification and sociolinguistic patterns among characters.

Contribution

It presents TV-MMPC, a novel annotated dataset, and evaluates multimodal LLMs on conversation structure understanding, highlighting performance drops with anonymized identities and sociolinguistic insights.

Findings

01

Multimodal LLMs outperform heuristics but struggle with anonymized identities.

02

Female characters are more likely to be addressees or side-participants.

03

Presence of side-participants shifts conversation from personal to social.

Abstract

While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multimodal Conversation Structure Understanding· underline

Taxonomy

TopicsLanguage, Metaphor, and Cognition