ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation
Chen Liang, Yu Wu, Yawei Luo, Yi Yang

TL;DR
ClawCraneNet introduces a top-down, object-level relation approach for text-based video segmentation, improving accuracy and explainability by modeling relations among candidate objects.
Contribution
The paper proposes a novel top-down method that models object-level relations, including positional, semantic, and temporal, for improved text-based video segmentation.
Findings
Outperforms state-of-the-art on A2D Sentences and J-HMDB Sentences datasets.
Achieves more explainable segmentation results.
Effectively models multi-level object relations for better understanding.
Abstract
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
