HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding
Yanzhao Shi, Xiaodan Zhang, Junzhong Ji, Haoning Jiang, Chengxin Zheng, Yinong Wang, Liangqiong Qu

TL;DR
HSENet is a novel framework that enhances 3D medical vision-language understanding by employing dual-3D vision encoders and a spatial projector, significantly improving performance in diagnosis-related tasks.
Contribution
The paper introduces HSENet, which uniquely combines dual-3D vision encoders with a spatial packer for efficient, accurate 3D medical visual-language understanding, addressing limitations of prior 2D-focused models.
Findings
Achieves state-of-the-art in 3D language-visual retrieval with 39.85% R@100
Improves 3D medical report generation BLEU-4 score to 24.01%
Enhances 3D visual question answering accuracy to 73.60%
Abstract
Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
