Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability
Rezaul Karim, Richard P. Wildes

TL;DR
This survey reviews recent transformer-based approaches to video segmentation, emphasizing model components, interpretability methods, and understanding temporal dynamics, highlighting gaps in prior reviews and suggesting future research directions.
Contribution
It provides a comprehensive, component-wise analysis of transformer-based video segmentation models and interpretability methods, addressing gaps in previous surveys focused mainly on classification tasks.
Findings
Thorough categorization of video segmentation tasks and datasets.
Detailed review of transformer-based models for various segmentation tasks.
Discussion of interpretability techniques specific to video transformers.
Abstract
Video segmentation encompasses a wide range of categories of problem formulation, e.g., object, scene, actor-action and multimodal video segmentation, for delineating task-specific scene components with pixel-level masks. Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models. In addition, various interpretability approaches have appeared for transformer models and video temporal dynamics, motivated by the growing interest in basic scientific understanding, model diagnostics and societal implications of real-world deployment. Previous surveys mainly focused on ConvNet models on a subset of video segmentation tasks or transformers for classification tasks. Moreover, component-wise discussion of transformer-based video segmentation models has not yet received due focus. In addition, previous reviews of interpretability methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Human Pose and Action Recognition
