HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

Yanzhao Shi; Xiaodan Zhang; Junzhong Ji; Haoning Jiang; Chengxin Zheng; Yinong Wang; Liangqiong Qu

arXiv:2506.09634·cs.CV·June 12, 2025

HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

Yanzhao Shi, Xiaodan Zhang, Junzhong Ji, Haoning Jiang, Chengxin Zheng, Yinong Wang, Liangqiong Qu

PDF

Open Access

TL;DR

HSENet is a novel framework that enhances 3D medical vision-language understanding by employing dual-3D vision encoders and a spatial projector, significantly improving performance in diagnosis-related tasks.

Contribution

The paper introduces HSENet, which uniquely combines dual-3D vision encoders with a spatial packer for efficient, accurate 3D medical visual-language understanding, addressing limitations of prior 2D-focused models.

Findings

01

Achieves state-of-the-art in 3D language-visual retrieval with 39.85% R@100

02

Improves 3D medical report generation BLEU-4 score to 24.01%

03

Enhances 3D visual question answering accuracy to 73.60%

Abstract

Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling