Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and   Time-Series Analysis

Badri N. Patro; Suhas Ranganath; Vinay P. Namboodiri; Vijay S.; Agneeswaran

arXiv:2403.18063·cs.CV·June 5, 2024·1 cites

Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S., Agneeswaran

PDF

Open Access 2 Repos

TL;DR

Heracles is a novel hybrid SSM-Transformer model that effectively captures local and global information for high-resolution image and time-series analysis, achieving state-of-the-art results across multiple datasets.

Contribution

The paper introduces Heracles, a hybrid SSM-Transformer that combines local and global SSMs with attention modules, addressing scalability and local information handling in high-resolution tasks.

Findings

01

Achieves 86.4% top-1 accuracy on ImageNet.

02

Outperforms existing models on multiple time-series datasets.

03

Excels in transfer learning and instance segmentation tasks.

Abstract

Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors

MethodsAttention Is All You Need · Depthwise Convolution · Linear Layer · Positional Encoding Generator · Average Pooling · Pointwise Convolution · Depthwise Separable Convolution · Conditional Positional Encoding · Multi-Head Attention · Softmax