Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, Yi, Yang

TL;DR
This paper introduces Locater, a novel Transformer-based model with a memory mechanism for language-guided video segmentation, effectively capturing long-term context and achieving state-of-the-art results on multiple datasets.
Contribution
The paper proposes a local-global context aware Transformer with a finite memory for efficient, accurate language-guided video segmentation, and introduces a new challenging dataset A2D-S+.
Findings
Locater outperforms previous methods on three LVS datasets.
Locater won 1st place in a major video object segmentation challenge.
The model processes videos with linear time complexity and constant memory size.
Abstract
We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components -- one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dense Connections · Attentive Walk-Aggregating Graph Neural Network
