Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Songze Li, Hanlei Zhang

TL;DR
HIER introduces a hierarchical semantic reasoning framework combined with evolutionary feedback for multimodal intent recognition, significantly improving accuracy by modeling complex intent structures and enabling self-adaptive reasoning.
Contribution
The paper presents HIER, a novel hierarchical semantic representation and evolutionary reasoning approach that enhances multimodal intent recognition with structured, self-evolving reasoning capabilities.
Findings
Outperforms state-of-the-art methods by 1-3% across benchmarks.
Effectively models hierarchical semantics for complex intents.
Enables dynamic self-evolution of semantic representations.
Abstract
Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems
