Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis
Hao Wu, Hui Li, Yiyun Su

TL;DR
This paper introduces Hilbert-VLM, a novel framework that enhances medical diagnosis accuracy by re-engineering SAM2 with Hilbert space-filling curves and a new attention mechanism, improving 3D medical image analysis.
Contribution
The paper presents a new two-stage fusion framework with a redesigned SAM2 architecture incorporating Hilbert curves and a novel attention mechanism for better 3D medical image processing.
Findings
Achieves 82.35% Dice score on BraTS2021 segmentation benchmark.
Attains 78.85% accuracy in disease classification.
Demonstrates improved spatial locality preservation in 3D data analysis.
Abstract
Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
