Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis
Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S., Agneeswaran

TL;DR
Heracles is a novel hybrid SSM-Transformer model that effectively captures local and global information for high-resolution image and time-series analysis, achieving state-of-the-art results across multiple datasets.
Contribution
The paper introduces Heracles, a hybrid SSM-Transformer that combines local and global SSMs with attention modules, addressing scalability and local information handling in high-resolution tasks.
Findings
Achieves 86.4% top-1 accuracy on ImageNet.
Outperforms existing models on multiple time-series datasets.
Excels in transfer learning and instance segmentation tasks.
Abstract
Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Depthwise Convolution · Linear Layer · Positional Encoding Generator · Average Pooling · Pointwise Convolution · Depthwise Separable Convolution · Conditional Positional Encoding · Multi-Head Attention · Softmax
