Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Ibrahim Ethem Hamamci; Sezgin Er; Suprosanna Shit; Hadrien Reynaud; Dong Yang; Pengfei Guo; Marc Edgar; Daguang Xu; Bernhard Kainz; Bjoern Menze

arXiv:2510.20639·cs.CV·October 24, 2025

Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze

PDF

Open Access

TL;DR

BTB3D introduces a novel 3D tokenization method with a specialized encoder-decoder architecture, significantly improving vision-language tasks in medical imaging by maintaining anatomical detail and scalability.

Contribution

The paper presents BTB3D, a new 3D tokenization approach with a causal convolutional encoder-decoder, enabling scalable, high-resolution vision-language modeling in 3D medical imaging.

Findings

01

Achieves state-of-the-art report generation BLEU scores and clinical F1 improvements.

02

Reduces FID by 75% and halves FVD in text-to-CT synthesis.

03

Supports scans exceeding 300 slices without extra memory overhead.

Abstract

Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications