U-VLM: Hierarchical Vision Language Modeling for Report Generation
Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang

TL;DR
U-VLM introduces a hierarchical vision-language model with multi-layer visual injection and progressive training, significantly improving automated radiology report generation for 3D medical imaging with a lightweight decoder.
Contribution
It presents a novel hierarchical architecture and training strategy that leverages segmentation-pretrained encoders and multi-scale visual features for report generation.
Findings
State-of-the-art results on CT-RATE and AbdomenAtlas 3.0 datasets.
Progressive pretraining improves F1 scores.
Multi-layer visual injection enhances BLEU-mean scores.
Abstract
Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Radiology practices and education · Artificial Intelligence in Healthcare and Education
