A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension

Mohammad Zia Ur Rehman; Devraj Raghuvanshi; Umang Jain; Shubhi Bansal; Nagendra Kumar

arXiv:2508.16300·cs.CV·August 25, 2025

A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension

Mohammad Zia Ur Rehman, Devraj Raghuvanshi, Umang Jain, Shubhi Bansal, Nagendra Kumar

PDF

TL;DR

This paper introduces MM-ORIENT, a novel multimodal-multitask framework that reduces noise effects and enhances discriminative feature learning through cross-modal relation graphs and hierarchical interactive attention, improving multimodal comprehension.

Contribution

The paper proposes a new framework combining cross-modal relation graphs and hierarchical attention to improve multimodal learning by reducing noise and preserving discriminative information.

Findings

01

Effective in multiple multimodal tasks

02

Reduces noise impact at the latent stage

03

Outperforms existing methods on three datasets

Abstract

A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.