Multimodal Conversation Structure Understanding
Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman

TL;DR
This paper introduces a new dataset and tasks for understanding the structure of multimodal conversations, revealing challenges in role identification and sociolinguistic patterns among characters.
Contribution
It presents TV-MMPC, a novel annotated dataset, and evaluates multimodal LLMs on conversation structure understanding, highlighting performance drops with anonymized identities and sociolinguistic insights.
Findings
Multimodal LLMs outperform heuristics but struggle with anonymized identities.
Female characters are more likely to be addressees or side-participants.
Presence of side-participants shifts conversation from personal to social.
Abstract
While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition
