Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion
Dexuan Ding, Lei Wang, Liyun Zhu, Tom Gedeon, Piotr Koniusz

TL;DR
This paper introduces a learnable, graph-based feature fusion method for multi-modal data in computer vision, improving the capture of structural relationships and deep interactions across diverse features.
Contribution
It proposes a novel, relationship-centric graph expansion and fusion technique that operates in a lower-dimensional, interpretable space, enhancing multi-modal feature integration.
Findings
Effective in video anomaly detection
Outperforms traditional fusion methods
Robust across multi-modal and multi-domain tasks
Abstract
In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we expand graphs through iterative…
Peer Reviews
Decision·ICLR 2025 Poster
1. The graph-based fusion approach is innovative as it focuses on relationship-centric fusion, potentially capturing deeper interactions and structural relationships. 2. The use of graph power expansions to model multi-hop connections is a strong point. 3. Introducing a learnable weight matrix to dynamically integrate different graph powers is a key advancement. 4. This paper is easy to follow.
1. While the method is sophisticated, its complexity might hinder interpretability. The relationship between graph power expansions and their real-world applications could be elaborated further for clarity. 2. The scalability of the graph-based approach for large-scale datasets or real-time applications is not fully addressed. More discussion on computational efficiency would strengthen the paper. 3. The integration of the LEGO method with existing machine learning frameworks or pipelines is not
The illustration of the method is clear, and this method is easy to follow. The performance is credible.
Currently, there are many methods based on cross-modal graph neural networks that create their own graph networks for features from different modalities, and then use inter-graph convolution to obtain cross-modal embeddings, facilitating feature propagation and fusion in the process. Compared to these methods, the improvements presented in this paper seem to consist only of weighted sums at different convolution depths. The authors need to explain the advantages of this approach over previous
1. The proposed framework seamlessly combines visual and text features, leveraging their complementary information to enhance anomaly detection. It utilizes graph power expansion and dynamically learns optimal weights for merging different graph powers, allowing the model to prioritize relevant relationships. 3. LEGO requires significantly fewer parameters than traditional methods like MTN fusion, resulting in faster training times. 4. LEGO consistently outperforms baseline and state-of-the-art
(See the questions section below)
Videos
Taxonomy
TopicsRough Sets and Fuzzy Logic · Face and Expression Recognition · Machine Learning and Data Classification
