${\mu}^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation
Siyou Li, Pengyao Qin, Huanan Wu, Dong Nie, Arun J. Thirunavukarasu, Juntao Yu, Le Zhang

TL;DR
This paper introduces $d$Tokenizer, a differentiable multi-scale multi-modal tokenizer for radiology report generation, improving report quality and evaluation through a novel intermediate layer and a scalable, LLM-driven pipeline.
Contribution
The paper presents a novel $d$Tokenizer and a fine-tuned $d$LLM framework that enhance radiology report generation from CT scans, addressing data limitations and evaluation challenges.
Findings
Outperforms existing methods on four CT datasets
Enhances report quality via direct preference optimization
Provides a scalable pipeline for high-quality supervision
Abstract
Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose LLM, a ltiscale ltimodal large language models for RRG tasks. The novel Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
