Deformba: Vision State Space Model with Adaptive State Fusion
Hongyu Ke, Jack Morris, Yongkang Liu, Satoshi Kitai, Kentaro Oguchi, Yi Ding, Haoxin Wang

TL;DR
Deformba introduces an adaptive vision state space model that dynamically enhances spatial information and supports multi-modal fusion, improving performance on diverse 2D and 3D vision tasks.
Contribution
It proposes a context adaptive method for vision SSMs that maintains linear complexity and enables flexible spatial augmentation and multi-modal fusion.
Findings
Achieves strong performance on image classification, detection, and segmentation.
Demonstrates effectiveness on 3D perception tasks like BEV perception.
Maintains linear complexity while enhancing spatial and multi-modal capabilities.
Abstract
State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
