Towards Evaluating the Robustness of Visual State Space Models
Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik, Nandakumar, Fahad Shahbaz Khan, Salman Khan

TL;DR
This paper evaluates the robustness of Vision State Space Models (VSSMs) against various natural and adversarial perturbations, comparing their performance with other architectures and analyzing their resilience in complex visual scenarios.
Contribution
It provides a comprehensive robustness evaluation of VSSMs across multiple perturbation types and benchmarks, highlighting their strengths and limitations in complex visual tasks.
Findings
VSSMs show robustness to certain corruptions but are vulnerable to others.
Frequency analysis reveals differential performance against low and high-frequency attacks.
VSSMs outperform some architectures in specific robustness scenarios.
Abstract
Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection
