VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis
Shicheng Yin, Kaixuan Yin, Weixing Chen, Enbo Huang, Yang Liu

TL;DR
VisionGRU introduces a linear-complexity RNN architecture for efficient high-resolution image analysis, outperforming ViTs in accuracy and resource usage, and enabling scalable vision tasks.
Contribution
It proposes a novel RNN-based model, VisionGRU, with a simplified minGRU and hierarchical modules for efficient multi-scale image feature extraction.
Findings
Outperforms ViTs on ImageNet and ADE20K datasets.
Reduces memory and computational costs significantly.
Effective for high-resolution image classification and segmentation.
Abstract
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. While CNNs excel at extracting multi-scale features and ViTs effectively capture global dependencies, both suffer from high computational costs, particularly when processing high-resolution images. Recently, state-space models (SSMs) and recurrent neural networks (RNNs) have attracted attention due to their efficiency. However, their performance in image classification tasks remains limited. To address these challenges, this paper introduces VisionGRU, a novel RNN-based architecture designed for efficient image classification. VisionGRU leverages a simplified Gated Recurrent Unit (minGRU) to process large-scale image features with linear complexity. It divides images into smaller patches and progressively reduces the sequence length while increasing the channel depth, thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Vision and Imaging · CCD and CMOS Imaging Sensors
MethodsSoftmax · Attention Is All You Need
