${\mu}^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation

Siyou Li; Pengyao Qin; Huanan Wu; Dong Nie; Arun J. Thirunavukarasu; Juntao Yu; Le Zhang

arXiv:2507.00316·cs.LG·July 3, 2025

${\mu}^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation

Siyou Li, Pengyao Qin, Huanan Wu, Dong Nie, Arun J. Thirunavukarasu, Juntao Yu, Le Zhang

PDF

Open Access 3 Models

TL;DR

This paper introduces $d$Tokenizer, a differentiable multi-scale multi-modal tokenizer for radiology report generation, improving report quality and evaluation through a novel intermediate layer and a scalable, LLM-driven pipeline.

Contribution

The paper presents a novel $d$Tokenizer and a fine-tuned $d$LLM framework that enhance radiology report generation from CT scans, addressing data limitations and evaluation challenges.

Findings

01

Outperforms existing methods on four CT datasets

02

Enhances report quality via direct preference optimization

03

Provides a scalable pipeline for high-quality supervision

Abstract

Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose $μ^{2}$ LLM, a $\underline{mu}$ ltiscale $\underline{mu}$ ltimodal large language models for RRG tasks. The novel $μ^{2}$ Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning