Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

Yuyuan Liu; Can Peng; Yingyu Yang; Qianye Yang; Cheng Ouyang; and J. Alison Noble

arXiv:2605.15997·cs.CV·May 18, 2026

Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

Yuyuan Liu, Can Peng, Yingyu Yang, Qianye Yang, Cheng Ouyang, and J. Alison Noble

PDF

TL;DR

This paper introduces a unified framework for CT image analysis that combines language-guided visual reasoning with progressive localization, improving segmentation and detection accuracy while providing interpretability.

Contribution

It proposes a novel autoregressive model integrating language and vision for comprehensive CT interpretation, with a new dataset and a coarse-to-fine attention mechanism.

Findings

01

Achieved up to 1.0% Dice improvement on BTCV benchmark.

02

Achieved up to 1.7% Dice improvement on MosMed+ benchmark.

03

Provided additional appearance reasoning outputs.

Abstract

Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.