Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu; Hui Li; Yiyun Su

arXiv:2512.24013·cs.CV·January 1, 2026

Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu, Hui Li, Yiyun Su

PDF

Open Access

TL;DR

This paper introduces Hilbert-VLM, a novel framework that enhances medical diagnosis accuracy by re-engineering SAM2 with Hilbert space-filling curves and a new attention mechanism, improving 3D medical image analysis.

Contribution

The paper presents a new two-stage fusion framework with a redesigned SAM2 architecture incorporating Hilbert curves and a novel attention mechanism for better 3D medical image processing.

Findings

01

Achieves 82.35% Dice score on BraTS2021 segmentation benchmark.

02

Attains 78.85% accuracy in disease classification.

03

Demonstrates improved spatial locality preservation in 3D data analysis.

Abstract

Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning