T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
Pranjal Khadka

TL;DR
This paper introduces T-Gated Adapter, a lightweight temporal transformer-based module that enhances vision-language models for 3D medical image segmentation by incorporating adjacent-slice context, improving accuracy and cross-domain robustness.
Contribution
It proposes a novel temporal adapter with a transformer, spatial refinement, and adaptive gating to leverage 3D context in vision-language models for medical segmentation.
Findings
Achieves a mean Dice of 0.704 on FLARE22, a +0.206 improvement over baseline.
Zero-shot results improve Dice by +0.210 and +0.230 on BTCV and AMOS22.
Cross-modality evaluation shows better generalization, with Dice of 0.366 on MRI, outperforming supervised 3D models.
Abstract
Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
