Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation
Chengxi Zeng, Xinyu Yang, Majid Mirmehdi, Alberto M Gambaruto, Tilo, Burghardt

TL;DR
Video-TransUNet is a novel deep learning architecture that combines CNNs, transformers, and temporal feature blending to improve instance segmentation accuracy in medical CT videos, specifically for swallowing studies.
Contribution
The paper introduces Video-TransUNet, integrating temporal feature blending into TransUNet for enhanced segmentation in medical videos, outperforming existing methods.
Findings
Achieves a dice coefficient of 0.8796 on VFSS2022 dataset.
Significantly outperforms state-of-the-art segmentation systems.
Provides open-source code and annotations for reproducibility.
Abstract
We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of 0.8796 and an average surface distance of 1.0379 pixels. Note that tracking the pharyngeal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDysphagia Assessment and Management · Tracheal and airway disorders · Voice and Speech Disorders
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · 1x1 Convolution · Batch Normalization · Label Smoothing · Bottleneck Residual Block · Position-Wise Feed-Forward Layer · Residual Connection · Adam
