Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

Dexuan Ding; Lei Wang; Liyun Zhu; Tom Gedeon; Piotr Koniusz

arXiv:2410.01506·cs.CV·March 3, 2025·2 cites

Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

Dexuan Ding, Lei Wang, Liyun Zhu, Tom Gedeon, Piotr Koniusz

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a learnable, graph-based feature fusion method for multi-modal data in computer vision, improving the capture of structural relationships and deep interactions across diverse features.

Contribution

It proposes a novel, relationship-centric graph expansion and fusion technique that operates in a lower-dimensional, interpretable space, enhancing multi-modal feature integration.

Findings

01

Effective in video anomaly detection

02

Outperforms traditional fusion methods

03

Robust across multi-modal and multi-domain tasks

Abstract

In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we expand graphs through iterative…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The graph-based fusion approach is innovative as it focuses on relationship-centric fusion, potentially capturing deeper interactions and structural relationships. 2. The use of graph power expansions to model multi-hop connections is a strong point. 3. Introducing a learnable weight matrix to dynamically integrate different graph powers is a key advancement. 4. This paper is easy to follow.

Weaknesses

1. While the method is sophisticated, its complexity might hinder interpretability. The relationship between graph power expansions and their real-world applications could be elaborated further for clarity. 2. The scalability of the graph-based approach for large-scale datasets or real-time applications is not fully addressed. More discussion on computational efficiency would strengthen the paper. 3. The integration of the LEGO method with existing machine learning frameworks or pipelines is not

Reviewer 02Rating 6Confidence 4

Strengths

The illustration of the method is clear, and this method is easy to follow. The performance is credible.

Weaknesses

Currently, there are many methods based on cross-modal graph neural networks that create their own graph networks for features from different modalities, and then use inter-graph convolution to obtain cross-modal embeddings, facilitating feature propagation and fusion in the process. Compared to these methods, the improvements presented in this paper seem to consist only of weighted sums at different convolution depths. The authors need to explain the advantages of this approach over previous

Reviewer 03Rating 6Confidence 3

Strengths

1. The proposed framework seamlessly combines visual and text features, leveraging their complementary information to enhance anomaly detection. It utilizes graph power expansion and dynamically learns optimal weights for merging different graph powers, allowing the model to prioritize relevant relationships. 3. LEGO requires significantly fewer parameters than traditional methods like MTN fusion, resulting in faster training times. 4. LEGO consistently outperforms baseline and state-of-the-art

Weaknesses

(See the questions section below)

Videos

Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion· slideslive

Taxonomy

TopicsRough Sets and Fuzzy Logic · Face and Expression Recognition · Machine Learning and Data Classification