MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi,, Gholamreza Haffari, Wray Buntine

TL;DR
This paper introduces MTP, a multi-modal dataset for identifying turning points in conversations, and presents TPMaven, a framework that effectively detects and classifies these critical moments using vision-language models.
Contribution
The work provides a new dataset with precise annotations of conversational turning points and a novel framework leveraging advanced models for detection and explanation.
Findings
TPMaven achieves an F1-score of 0.88 in classification.
The dataset includes high-consensus, multi-modal annotations.
Explanations generated align well with human judgments.
Abstract
Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
