Deep Equilibrium Multimodal Fusion
Jinhong Ni, Yalong Bai, Wei Zhang, Ting Yao, Tao Mei

TL;DR
This paper introduces a deep equilibrium (DEQ) approach for multimodal fusion that models complex interactions across modalities adaptively, achieving state-of-the-art results on multiple benchmarks.
Contribution
It proposes a novel DEQ-based fusion method that captures dynamic, recursive feature correlations across modalities, improving over fixed strategies.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Effectively models intra- and inter-modality correlations.
Demonstrates versatility across various multimodal tasks.
Abstract
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. Most existing fusion approaches either learn a fixed fusion strategy during training and inference, or are only capable of fusing the information to a certain extent. Such solutions may fail to fully capture the dynamics of interactions across modalities especially when there are complex intra- and inter-modality correlations to be considered for informative multimodal fusion. In this paper, we propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process and modeling the feature correlations in an adaptive and recursive manner. This new way encodes the rich information within and across modalities thoroughly from low level to high level for efficacious downstream multimodal learning…
Peer Reviews
Decision·Submitted to ICLR 2024
(1) An interesting paper, the proposed DEQ method for multimodal fusion could be a new perspective in the field. By achieving equilibrium, the model could handle complex interactions between different types of data, potentially leading to better performance in tasks that require a comprehensive understanding of multimodal information. It is also a nice contribution to stability and robustness in the learning process for multimodal data. (2) The experimental results are promising.
(1) DEQ models can be complex and require significant computational resources for training and inference. The search for a fixed point can sometimes lead to difficulties in convergence, especially in dynamically changing environments/contexts. There may be challenges in generalising the fixed-point approach to different types of multimodal data or applications. (2) The paper may lack extensive evaluation against challenging applications, which is crucial to establish its real-world effectivenes
(1) This method innovatively combines multimodal fusion with DEQ framework to iteratively achieve multi-level multimodal fusion while retaining single-modal information (2) The experiments proves the effectiveness of the method, and the ablation experiment is relatively complete. The weight visualization in Figure 3 dynamically perceives modality importance for efficacious downstream multimodal learning, which is intuitive.
1. The method in this paper is compared with the weight-tied method, which shows that the method in this paper can converge. This is obvious because the method optimizes fθ by the formula z* = fθ(z*,x), and does not impose such a constraint on the weight-tied method with a finite number of layers, and the weight-tied method certainly cannot converge. 2. In the original DEQ paper, DEQ is proposed for memory efficiency, and the effect is similar to that of weight-tied, and it would be better if th
+ This study concentrates on the development of an innovative multi-modal fusion method, which endeavors to attain a state of equilibrium among features, markedly distinguishing itself from prior fusion processes. + The explanation in this article clearly and meticulously depicts its fusion architecture.
- The phrase "every level" in the introduction implies a comprehensive integration of cross-modality interactions throughout the multi-modal fusion process. However, given the paper’s focus on fusion of features, which is traditionally associated with late fusion, there seems to be a discrepancy. The paper apparently does not delve into early or middle fusion strategies. To reconcile this, one could interpret “every level” as referring to different stages or aspects within the late fusion proces
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Speech and Audio Processing
Methodsfail · Deep Equilibrium Models
