Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain
Juntao Zhang, Shaogeng Liu, Jun Zhou, Kun Bian, You Zhou, Jianning Liu, Pei Zhang, Bingyan Liu

TL;DR
Vim-F introduces a novel visual model that combines frequency and spatial domain processing using FFT and pure Mamba encoders, enhancing global receptive fields and local correlation capture for improved image understanding.
Contribution
The paper proposes Vim-F, a new visual model that integrates frequency domain features with spatial features, removing position embedding and redesigning patch embedding for better performance.
Findings
Vim-F achieves superior performance on visual tasks compared to traditional ViMs.
Frequency domain integration enhances the global receptive field of the model.
Removing position embedding does not harm, and may improve, model effectiveness.
Abstract
In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
