Understanding Video Transformers for Segmentation: A Survey of   Application and Interpretability

Rezaul Karim; Richard P. Wildes

arXiv:2310.12296·cs.CV·October 20, 2023·1 cites

Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability

Rezaul Karim, Richard P. Wildes

PDF

Open Access

TL;DR

This survey reviews recent transformer-based approaches to video segmentation, emphasizing model components, interpretability methods, and understanding temporal dynamics, highlighting gaps in prior reviews and suggesting future research directions.

Contribution

It provides a comprehensive, component-wise analysis of transformer-based video segmentation models and interpretability methods, addressing gaps in previous surveys focused mainly on classification tasks.

Findings

01

Thorough categorization of video segmentation tasks and datasets.

02

Detailed review of transformer-based models for various segmentation tasks.

03

Discussion of interpretability techniques specific to video transformers.

Abstract

Video segmentation encompasses a wide range of categories of problem formulation, e.g., object, scene, actor-action and multimodal video segmentation, for delineating task-specific scene components with pixel-level masks. Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models. In addition, various interpretability approaches have appeared for transformer models and video temporal dynamics, motivated by the growing interest in basic scientific understanding, model diagnostics and societal implications of real-world deployment. Previous surveys mainly focused on ConvNet models on a subset of video segmentation tasks or transformers for classification tasks. Moreover, component-wise discussion of transformer-based video segmentation models has not yet received due focus. In addition, previous reviews of interpretability methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Human Pose and Action Recognition