OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, Peng LU, Yueting Zhuang, Ling Zhang, Beng Chin Ooi, Yingda Xia

TL;DR
OmniCT introduces a unified large vision-language model for CT analysis that combines slice and volume understanding, improving clinical interpretation by enhancing spatial consistency and semantic alignment across diverse tasks.
Contribution
The paper presents OmniCT, a novel model that unifies slice and volumetric CT analysis through spatial and semantic enhancements, addressing limitations of existing LVLMs.
Findings
Outperforms existing methods across multiple clinical tasks
Achieves high spatial and semantic consistency in CT understanding
Introduces the largest slice-volume CT dataset and benchmark
Abstract
Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is strongly motivated and addresses a real, clinically relevant gap in current medical LVLMs: 1. Strong and clinically grounded motivation and 2. Unified slice–volume modeling framework It proposes a coherent architectural design (SCE + OSE + MoE projection) that enables shared representation learning between slice-driven and volume-driven inputs. the fragmentation between 2D slice-based and 3D volume-based CT understanding. Its unified slice–volume framework is conceptually coheren
1. Conceptual novelty and differentiation: The “unified slice-volume” design is valuable but not fundamentally new, similar goals have been pursued by Med-2E3, Med3DInsight, and hybrid 2D/3D LVLMs. Can the authors clarify more clearly articulate what OmniCT does differently (e.g., tri-axial positional encoding vs. cross-slice attention in Med-2E3, why tri-axial positional encoding is important? Is there any ablation for this?). 2. Architectural complexity vs. simplicity: In the MoE routing, onl
The paper exceeds their baselines across many benchmarks, yielding performance improvements and pushing the state-of-the-art of medical image understanding. Moreover they introduce a way to leverage both, 2D and 3D medical images.
My main concern with the paper two-fold: Firstly and most-importantly, the motivation of the paper is unclear to me. The authors motivation originates largely from the fact that there exist 2D and 3D CT medical image datasets, however by default all CT images are 3D. Subsequently, i don't know (and the authors don't motivate well) why 2D slices are needed to be integrated. How would one get the 2D slice selection? If a professional is in the loop, why would I need the LVLM? This question puts
- The proposed SCE/OSE design directly addresses the gap between slice-driven detail sensitivity and volume-driven spatial reasoning, which are squarely motivated by clinical reading patterns. - OmniCT shows consistent improvements over some models across diverse 2D and 3D CT benchmarks, with well-documented ablations supporting the claims. - Table 1 (SCE/OSE ablation) and studies on mixed-data training, encoder choices (2D vs 3D), and organ/task-level heatmaps add useful insight and support t
- While MedEval-CT is positioned as “largest” (~1.7M) and “holistic,” the data sources, licensing, and deduplication across training vs evaluation (and vs existing public benchmarks) are not detailed enough to rule out contamination/leakage. - The pipeline leans on large models (Qwen2.5-VL-72B, Qwen3-237B-A3B) for selection/mapping/refinement. This raises bias propagation questions and potential circularity if similar model families are used in training/eval. - The 2D-encoder-dominant finding
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
