DCFormer: Efficient 3D Vision-Language Modeling with Decomposed   Convolutions

Gorkem Can Ates; Yu Xin; Kuang Gong; and Wei Shao

arXiv:2502.05091·cs.CV·April 28, 2025·2 cites

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

Gorkem Can Ates, Yu Xin, Kuang Gong, and Wei Shao

PDF

Open Access

TL;DR

DCFormer introduces a computationally efficient 3D vision-language model that uses decomposed convolutions to improve performance on medical imaging tasks, enabling scalable and deployable 3D medical VLMs.

Contribution

The paper proposes a novel 3D image encoder with decomposed convolutions that reduces computational cost while maintaining spatial information, integrated into a CLIP-based framework for medical imaging.

Findings

01

Outperforms state-of-the-art 3D vision encoders in pathology detection

02

Achieves superior results in image-text retrieval tasks

03

Demonstrates scalability and clinical applicability of 3D VLMs

Abstract

Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsPoolFormer · ConvNeXt · ALIGN