Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin; Maximilian Beck; Korbinian P\"oppel; Sepp; Hochreiter; Johannes Brandstetter

arXiv:2406.04303·cs.CV·February 24, 2025·20 cites

Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin, Maximilian Beck, Korbinian P\"oppel, Sepp, Hochreiter, Johannes Brandstetter

PDF

Open Access 2 Repos 1 Video 3 Reviews

TL;DR

This paper introduces Vision-LSTM (ViL), a novel vision backbone based on xLSTM blocks, which process image patches in a bidirectional manner, offering a promising alternative to transformers for computer vision tasks.

Contribution

The paper adapts the scalable xLSTM architecture to vision, creating ViL, a new backbone that processes patch sequences bidirectionally, potentially enhancing vision model performance.

Findings

01

ViL demonstrates competitive performance as a vision backbone.

02

Bidirectional processing of patches improves feature extraction.

03

ViL shows promise for future deployment in vision architectures.

Abstract

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

1. It's interesting to try transfering different language models into vision. As Tranformer has shown a very successful adaptation in computer vision and Mamba has recently been introduced into various vision tasks, showing comparable performance, validating the similar effect of LSTM can provide many insights to the community. 2. The performance is good. xLSTM shows competitve results in classification and semantic segmentation. 3. The detailed ablations of architectural design are interestin

Weaknesses

1. On the ImageNet-1k classification task, the model seems not to scale well. The ViL-Base underperforms DeiT-III by a large margin. Is this caused by a technical reason (e.g., insufficient hyper-parameter search) or the limitation of LSTM's learning capacity? Can ViL scale to a larger size? 2. In the main tables of the paper, the authors emphasize comparing the models' FLOPs as a measure of speed, which may not be a fair comparison between recurrent models and transformers. Typically, at the s

Reviewer 02Rating 6Confidence 3

Strengths

1. The proposed ViL display xLSTM also performs well in visual feature encoding and can be considered a strong candidate for a universal visual backbone. 2. Extensive experiments are conducted to verify the strong performance of ViL on three vision tasks.

Weaknesses

1. The technical contribution is limited: the proposed ViL is a simple adaptation of xLSTM blocks to vision tasks. Although it contains some necessary modifications for processing non-causal image data (bidirectional flip, conv2d, etc.), it is still straightforward. 2. Lack of experiments to prove the main advantages of ViL: compared with transformers, the ViL has linear complexity. But the experiments do not provide enough evidence to show this advantage. For example, the mentioned lack of an

Reviewer 03Rating 6Confidence 4

Strengths

1. The new attemption of new linear vision backbone is great. 2. This work has detailed experimental setup in classification, transfer learning and segmentation. ViL performs well on ImageNet accuracy, ADE20K mIoU and VTAB-1K accuracy.

Weaknesses

1. Because of the good training receipt (data augmentation, optimization method etc.), it is not difficult to get good performance to train a new vision backbone. My main concern is how to validate the scaling law of a new backbone, namely the proposed ViL. 2. The largest model size of ViL is 89M and 115M (ViL-B), so how to validate the performance still can keep spurious with larger model size.

Code & Models

Repositories

Videos

Vision-LSTM: xLSTM as Generic Vision Backbone· slideslive

Taxonomy

TopicsInfrared Target Detection Methodologies · Robotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory