Scalable Visual State Space Model with Fractal Scanning
Lv Tang, HaoKe Xiao, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li

TL;DR
This paper introduces a fractal scanning approach for serializing image patches in State Space Models, significantly improving their ability to model complex spatial patterns and outperforming existing methods in vision tasks.
Contribution
The paper proposes using fractal scanning curves for patch serialization in SSMs, addressing limitations of linear methods and enhancing performance in vision tasks.
Findings
Fractal scanning improves spatial relationship modeling.
Enhanced performance in image classification, detection, and segmentation.
Outperforms existing serialization methods in SSMs.
Abstract
Foundational models have significantly advanced in natural language processing (NLP) and computer vision (CV), with the Transformer architecture becoming a standard backbone. However, the Transformer's quadratic complexity poses challenges for handling longer sequences and higher resolution images. To address this challenge, State Space Models (SSMs) like Mamba have emerged as efficient alternatives, initially matching Transformer performance in NLP tasks and later surpassing Vision Transformers (ViTs) in various CV tasks. To improve the performance of SSMs, one crucial aspect is effective serialization of image patches. Existing methods, relying on linear scanning curves, often fail to capture complex spatial relationships and produce repetitive patterns, leading to biases. To address these limitations, we propose using fractal scanning curves for patch serialization. Fractal curves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Currency Recognition and Detection · Image Processing and 3D Reconstruction
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout
