Zoom and Shift are All You Need

Jiahao Qin

arXiv:2406.08866·cs.CV·June 14, 2024

Zoom and Shift are All You Need

Jiahao Qin

PDF

Open Access

TL;DR

This paper introduces a novel feature alignment method that alternates shifting and expanding features across modalities to achieve full integration, leading to improved multimodal learning performance.

Contribution

The proposed approach offers a new technique for multimodal feature fusion that outperforms existing methods across various data types and tasks.

Findings

01

Achieves state-of-the-art results on multimodal datasets

02

Demonstrates reliable high-level feature interplay capture

03

Outperforms prevalent fusion schemes

Abstract

Feature alignment serves as the primary mechanism for fusing multimodal data. We put forth a feature alignment approach that achieves full integration of multimodal information. This is accomplished via an alternating process of shifting and expanding feature representations across modalities to obtain a consistent unified representation in a joint feature space. The proposed technique can reliably capture high-level interplay between features originating from distinct modalities. Consequently, substantial gains in multimodal learning performance are attained. Additionally, we demonstrate the superiority of our approach over other prevalent multimodal fusion schemes on a range of tasks. Extensive experimental evaluation conducted on multimodal datasets comprising time series, image, and text demonstrates that our method achieves state-of-the-art results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHistory and Developments in Astronomy