M2I2HA: Multi-modal Object Detection Based on Intra- and Inter-Modal Hypergraph Attention

Xiaofan Yang; Yubin Liu; Wei Pan; Guoqing Chu; Junming Zhang; Jie Zhao; Zhuoqi Man; Xuanming Cao

arXiv:2601.14776·cs.CV·January 27, 2026

M2I2HA: Multi-modal Object Detection Based on Intra- and Inter-Modal Hypergraph Attention

Xiaofan Yang, Yubin Liu, Wei Pan, Guoqing Chu, Junming Zhang, Jie Zhao, Zhuoqi Man, Xuanming Cao

PDF

Open Access

TL;DR

This paper introduces M2I2HA, a hypergraph-based multi-modal detection network that effectively models complex relationships within and across modalities, achieving state-of-the-art results in challenging environments.

Contribution

The paper proposes a novel hypergraph-based architecture with intra- and inter-modal modules for enhanced multi-modal feature extraction and fusion in object detection.

Findings

01

Achieves state-of-the-art detection accuracy on multiple datasets.

02

Effectively models high-order relationships within each modality.

03

Enhances cross-modal feature alignment and fusion.

Abstract

Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Multimodal Machine Learning Applications