Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Qianrui Zhou; Hua Xu; Yunjin Gu; Yifan Wang; Songze Li; Hanlei Zhang

arXiv:2603.03827·cs.MM·March 5, 2026

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Songze Li, Hanlei Zhang

PDF

Open Access

TL;DR

HIER introduces a hierarchical semantic reasoning framework combined with evolutionary feedback for multimodal intent recognition, significantly improving accuracy by modeling complex intent structures and enabling self-adaptive reasoning.

Contribution

The paper presents HIER, a novel hierarchical semantic representation and evolutionary reasoning approach that enhances multimodal intent recognition with structured, self-evolving reasoning capabilities.

Findings

01

Outperforms state-of-the-art methods by 1-3% across benchmarks.

02

Effectively models hierarchical semantics for complex intents.

03

Enables dynamic self-evolution of semantic representations.

Abstract

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems