Medical Referring Image Segmentation via Next-Token Mask Prediction
Xinyu Chen, Yiran Wang, Gaoyang Pang, Jiafu Hao, Chentao Yue, Luping Zhou, Yonghui Li

TL;DR
This paper introduces NTP-MRISeg, a unified autoregressive framework for medical image segmentation based on natural language, which simplifies design and improves performance through novel training strategies and leveraging pretrained tokenizers.
Contribution
The work reformulates MRIS as a next-token prediction task, eliminating complex fusion modules and enabling end-to-end training with pretrained multimodal tokenizers.
Findings
Achieves state-of-the-art results on QaTa-COV19 and MosMedData+ datasets.
Streamlines MRIS model design with a unified autoregressive approach.
Enhances boundary sensitivity and reduces errors with novel training strategies.
Abstract
Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Retinal Imaging and Analysis
