Dynamic Spectrum Mixer for Visual Recognition
Zhiqiang Hu, Tao Yu

TL;DR
The paper introduces the Dynamic Spectrum Mixer (DSM), a novel frequency-domain approach for vision backbones that adaptively emphasizes informative spectral bands, improving performance across various visual recognition tasks.
Contribution
It proposes a content-adaptive, frequency-domain token interaction method using DCT and dynamic spectrum weights, enhancing adaptability and efficiency over existing MLP and transformer models.
Findings
Achieves 83.8% top-1 accuracy on ImageNet
Attains 49.9% mIoU on ADE20K
Outperforms previous models in classification, detection, segmentation
Abstract
Recently, MLP-based vision backbones have achieved promising performance in several visual recognition tasks. However, the existing MLP-based methods directly aggregate tokens with static weights, leaving the adaptability to different images untouched. Moreover, Recent research demonstrates that MLP-Transformer is great at creating long-range dependencies but ineffective at catching high frequencies that primarily transmit local information, which prevents it from applying to the downstream dense prediction tasks, such as semantic segmentation. To address these challenges, we propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM). The DSM represents token interactions in the frequency domain by employing the Discrete Cosine Transform, which can learn long-term spatial dependencies with log-linear complexity. Furthermore, a dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Computing and Algorithms · Advanced Image and Video Retrieval Techniques
MethodsDiscrete Cosine Transform
