Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers

Yusuf Shihata

arXiv:2507.02985·cs.CV·July 8, 2025

Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers

Yusuf Shihata

PDF

TL;DR

Gated Recursive Fusion (GRF) introduces a scalable, recurrent multimodal transformer architecture that processes multiple modalities sequentially, maintaining competitive performance while significantly reducing computational complexity.

Contribution

The paper proposes GRF, a novel linear-scaling, stateful fusion method that combines cross-modal attention with a gated recurrent mechanism for efficient multimodal learning.

Findings

01

Achieves competitive results on CMU-MOSI benchmark.

02

Creates structured, class-separable representations.

03

Scales linearly with the number of modalities.

Abstract

Multimodal learning faces a fundamental tension between deep, fine-grained fusion and computational scalability. While cross-attention models achieve strong performance through exhaustive pairwise fusion, their quadratic complexity is prohibitive for settings with many modalities. We address this challenge with Gated Recurrent Fusion (GRF), a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline. Our method processes modalities sequentially, updating an evolving multimodal context vector at each step. The core of our approach is a fusion block built on Transformer Decoder layers that performs symmetric cross-attention, mutually enriching the shared context and the incoming modality. This enriched information is then integrated via a Gated Fusion Unit (GFU) a GRU-inspired mechanism that dynamically arbitrates information flow,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.