TL;DR
MISID is a new multimodal, multi-turn dataset from strategic games designed to challenge and improve intent recognition models in complex, deceptive, multi-participant interactions.
Contribution
The paper introduces MISID, a comprehensive dataset with detailed annotations for long-context intent recognition, and proposes FRACTAM, a novel framework that enhances model performance in complex scenarios.
Findings
State-of-the-art models show deficiencies in complex intent recognition tasks.
FRACTAM improves intent detection accuracy in multi-turn, multimodal scenarios.
The dataset reveals critical challenges like visual hallucination and limited causal chaining in current models.
Abstract
Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
