Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
Yuyuan Liu, Can Peng, Yingyu Yang, Qianye Yang, Cheng Ouyang, and J. Alison Noble

TL;DR
This paper introduces a unified framework for CT image analysis that combines language-guided visual reasoning with progressive localization, improving segmentation and detection accuracy while providing interpretability.
Contribution
It proposes a novel autoregressive model integrating language and vision for comprehensive CT interpretation, with a new dataset and a coarse-to-fine attention mechanism.
Findings
Achieved up to 1.0% Dice improvement on BTCV benchmark.
Achieved up to 1.7% Dice improvement on MosMed+ benchmark.
Provided additional appearance reasoning outputs.
Abstract
Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
