Multimodal Classification via Modal-Aware Interactive Enhancement
Qing-Yuan Jiang, Zhouyang Chi, Yang Yang

TL;DR
This paper introduces a modal-aware interactive enhancement (MIE) method for multimodal learning that uses sharpness aware minimization and gradient modification to improve generalization and address modality imbalance issues.
Contribution
The paper proposes a novel MIE approach that enhances multimodal interaction through SAM-based optimization and gradient strategies, improving performance and reducing modality forgetting.
Findings
Outperforms state-of-the-art methods on multiple datasets.
Improves generalization and reduces modality imbalance.
Achieves the best performance in experiments.
Abstract
Due to the notorious modality imbalance problem, multimodal learning (MML) leads to the phenomenon of optimization imbalance, thus struggling to achieve satisfactory performance. Recently, some representative methods have been proposed to boost the performance, mainly focusing on adaptive adjusting the optimization of each modality to rebalance the learning speed of dominant and non-dominant modalities. To better facilitate the interaction of model information in multimodal learning, in this paper, we propose a novel multimodal learning method, called modal-aware interactive enhancement (MIE). Specifically, we first utilize an optimization strategy based on sharpness aware minimization (SAM) to smooth the learning objective during the forward phase. Then, with the help of the geometry property of SAM, we propose a gradient modification strategy to impose the influence between different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Video Analysis and Summarization · Text and Document Classification Technologies
MethodsAttentive Walk-Aggregating Graph Neural Network · Segment Anything Model · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
