MambaVF: State Space Model for Efficient Video Fusion
Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler

TL;DR
MambaVF introduces a state space model-based framework for efficient video fusion that eliminates explicit motion estimation, significantly reducing computational costs while achieving state-of-the-art results across various video fusion tasks.
Contribution
It reformulates video fusion as a sequential state update process using state space models, enabling long-range temporal modeling with linear complexity and high efficiency.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Reduces parameters by up to 92.25% and FLOPs by 88.79%.
Provides a 2.1x speedup over existing methods.
Abstract
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
