Vivim: a Video Vision Mamba for Medical Video Segmentation

Yijun Yang; Zhaohu Xing; Lequan Yu; Chunwang Huang; Huazhu Fu; Lei Zhu

arXiv:2401.14168·cs.CV·August 2, 2024·26 cites

Vivim: a Video Vision Mamba for Medical Video Segmentation

Yijun Yang, Zhaohu Xing, Lequan Yu, Chunwang Huang, Huazhu Fu, Lei Zhu

PDF

Open Access 1 Repo

TL;DR

Vivim introduces a novel Video Vision Mamba framework leveraging state space models for efficient long-term medical video segmentation, outperforming existing methods in accuracy and computational efficiency.

Contribution

The paper proposes Vivim, a new framework that combines state space models with a Temporal Mamba Block and boundary-aware constraints for improved medical video segmentation.

Findings

01

Outperforms existing methods in thyroid, breast, and colonoscopy video segmentation.

02

Effectively captures long-term spatiotemporal dependencies with reduced computational cost.

03

Demonstrates superior accuracy and efficiency on multiple medical video datasets.

Abstract

Medical video segmentation gains increasing attention in clinical practice due to the redundant dynamic references in video frames. However, traditional convolutional neural networks have a limited receptive field and transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. This bottleneck poses a significant challenge when processing longer sequences in medical video analysis tasks using available devices with limited memory. Recently, state space models (SSMs), famous by Mamba, have exhibited impressive achievements in efficient long sequence modeling, which develops deep neural networks by expanding the receptive field on many vision tasks significantly. Unfortunately, vanilla SSMs failed to simultaneously capture causal temporal cues and preserve non-casual spatial information. To this end, this paper presents a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scott-yjyang/vivim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings