MambaVF: State Space Model for Efficient Video Fusion

Zixiang Zhao; Yukun Cui; Lilun Deng; Haowen Bai; Haotong Qin; Tao Feng; Konrad Schindler

arXiv:2602.06017·cs.CV·February 6, 2026

MambaVF: State Space Model for Efficient Video Fusion

Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler

PDF

Open Access

TL;DR

MambaVF introduces a state space model-based framework for efficient video fusion that eliminates explicit motion estimation, significantly reducing computational costs while achieving state-of-the-art results across various video fusion tasks.

Contribution

It reformulates video fusion as a sequential state update process using state space models, enabling long-range temporal modeling with linear complexity and high efficiency.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Reduces parameters by up to 92.25% and FLOPs by 88.79%.

03

Provides a 2.1x speedup over existing methods.

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis