Sequencer: Deep LSTM for Image Classification
Yuki Tatsunami, Masato Taki

TL;DR
Sequencer introduces a novel LSTM-based architecture for image classification, rivaling Vision Transformers by modeling long-range dependencies without self-attention, and demonstrates strong performance on ImageNet-1K.
Contribution
This paper presents Sequencer, a new LSTM-based architecture for vision tasks, offering an alternative to self-attention models like ViT with competitive accuracy.
Findings
Sequencer2D-L achieves 84.6% top-1 accuracy on ImageNet-1K.
The model demonstrates good transferability to other datasets.
It maintains robust performance across different input resolutions.
Abstract
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Average Pooling · Dropout · Global Average Pooling · Sigmoid Activation · Tanh Activation · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer
