Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
Fady Ibrahim, Guangjun Liu, Guanghui Wang

TL;DR
This paper systematically compares six discretization schemes within Vision Mamba, revealing that bilinear transform offers the best balance of accuracy and efficiency for SSM-based vision models.
Contribution
It provides an empirical evaluation of multiple discretization methods in Vision Mamba, recommending BIL as the default for improved performance.
Findings
POL and HOH improve accuracy but increase training time.
BIL offers consistent improvements with modest overhead.
Discretization choice significantly impacts SSM-based vision model performance.
Abstract
Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
