Scalable Visual State Space Model with Fractal Scanning

Lv Tang; HaoKe Xiao; Peng-Tao Jiang; Hao Zhang; Jinwei Chen; Bo Li

arXiv:2405.14480·cs.CV·May 28, 2024·5 cites

Scalable Visual State Space Model with Fractal Scanning

Lv Tang, HaoKe Xiao, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li

PDF

Open Access

TL;DR

This paper introduces a fractal scanning approach for serializing image patches in State Space Models, significantly improving their ability to model complex spatial patterns and outperforming existing methods in vision tasks.

Contribution

The paper proposes using fractal scanning curves for patch serialization in SSMs, addressing limitations of linear methods and enhancing performance in vision tasks.

Findings

01

Fractal scanning improves spatial relationship modeling.

02

Enhanced performance in image classification, detection, and segmentation.

03

Outperforms existing serialization methods in SSMs.

Abstract

Foundational models have significantly advanced in natural language processing (NLP) and computer vision (CV), with the Transformer architecture becoming a standard backbone. However, the Transformer's quadratic complexity poses challenges for handling longer sequences and higher resolution images. To address this challenge, State Space Models (SSMs) like Mamba have emerged as efficient alternatives, initially matching Transformer performance in NLP tasks and later surpassing Vision Transformers (ViTs) in various CV tasks. To improve the performance of SSMs, one crucial aspect is effective serialization of image patches. Existing methods, relying on linear scanning curves, often fail to capture complex spatial relationships and produce repetitive patterns, leading to biases. To address these limitations, we propose using fractal scanning curves for patch serialization. Fractal curves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Currency Recognition and Detection · Image Processing and 3D Reconstruction

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout