VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Maximilian Rokuss; Moritz Langenberg; Yannick Kirchhoff; Fabian Isensee; Benjamin Hamm; Constantin Ulrich; Sebastian Regnery; Lukas Bauer; Efthimios Katsigiannopulos; Tobias Norajitra; Klaus Maier-Hein

arXiv:2511.11450·cs.CV·November 17, 2025

VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein

PDF

Open Access 1 Models

TL;DR

VoxTell is a versatile vision-language model that enables free-text prompted 3D medical image segmentation, demonstrating state-of-the-art zero-shot performance across multiple imaging modalities and classes.

Contribution

It introduces a novel multi-stage fusion approach for aligning textual and visual features in 3D medical segmentation, trained on a large diverse dataset.

Findings

01

Achieves state-of-the-art zero-shot segmentation performance

02

Demonstrates strong cross-modality transfer and robustness

03

Provides accurate instance-specific segmentation from free text

Abstract

We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
mrokuss/VoxTell
model· 351 dl· ♡ 14
351 dl♡ 14

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI