Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

Juntao Zhang; Shaogeng Liu; Jun Zhou; Kun Bian; You Zhou; Jianning Liu; Pei Zhang; Bingyan Liu

arXiv:2405.18679·cs.CV·September 26, 2025·2 cites

Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

Juntao Zhang, Shaogeng Liu, Jun Zhou, Kun Bian, You Zhou, Jianning Liu, Pei Zhang, Bingyan Liu

PDF

Open Access 1 Repo

TL;DR

Vim-F introduces a novel visual model that combines frequency and spatial domain processing using FFT and pure Mamba encoders, enhancing global receptive fields and local correlation capture for improved image understanding.

Contribution

The paper proposes Vim-F, a new visual model that integrates frequency domain features with spatial features, removing position embedding and redesigning patch embedding for better performance.

Findings

01

Vim-F achieves superior performance on visual tasks compared to traditional ViMs.

02

Frequency domain integration enhances the global receptive field of the model.

03

Removing position embedding does not harm, and may improve, model effectiveness.

Abstract

In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yws-wxs/vim-f
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications