DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions
Gorkem Can Ates, Yu Xin, Kuang Gong, and Wei Shao

TL;DR
DCFormer introduces a computationally efficient 3D vision-language model that uses decomposed convolutions to improve performance on medical imaging tasks, enabling scalable and deployable 3D medical VLMs.
Contribution
The paper proposes a novel 3D image encoder with decomposed convolutions that reduces computational cost while maintaining spatial information, integrated into a CLIP-based framework for medical imaging.
Findings
Outperforms state-of-the-art 3D vision encoders in pathology detection
Achieves superior results in image-text retrieval tasks
Demonstrates scalability and clinical applicability of 3D VLMs
Abstract
Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsPoolFormer · ConvNeXt · ALIGN
