U-VLM: Hierarchical Vision Language Modeling for Report Generation

Pengcheng Shi; Minghui Zhang; Kehan Song; Jiaqi Liu; Yun Gu; Xinglin Zhang

arXiv:2603.00479·cs.CV·March 3, 2026

U-VLM: Hierarchical Vision Language Modeling for Report Generation

Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang

PDF

Open Access

TL;DR

U-VLM introduces a hierarchical vision-language model with multi-layer visual injection and progressive training, significantly improving automated radiology report generation for 3D medical imaging with a lightweight decoder.

Contribution

It presents a novel hierarchical architecture and training strategy that leverages segmentation-pretrained encoders and multi-scale visual features for report generation.

Findings

01

State-of-the-art results on CT-RATE and AbdomenAtlas 3.0 datasets.

02

Progressive pretraining improves F1 scores.

03

Multi-layer visual injection enhances BLEU-mean scores.

Abstract

Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Radiology practices and education · Artificial Intelligence in Healthcare and Education