TL;DR
ConvVitMamba is a hybrid deep learning framework that combines multiscale convolution, Vision Transformers, and Mamba-inspired modules to efficiently classify hyperspectral images with high accuracy and reduced computational cost.
Contribution
It introduces a novel unified architecture integrating multiscale convolution, transformer, and Mamba modules for efficient hyperspectral image classification.
Findings
Outperforms CNN, Transformer, and Mamba-based methods on benchmark datasets.
Achieves a good balance between accuracy, model size, and inference speed.
Ablation studies validate the effectiveness of each component.
Abstract
Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
