ClawCraneNet: Leveraging Object-level Relation for Text-based Video   Segmentation

Chen Liang; Yu Wu; Yawei Luo; Yi Yang

arXiv:2103.10702·cs.CV·January 22, 2024·20 cites

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Chen Liang, Yu Wu, Yawei Luo, Yi Yang

PDF

Open Access

TL;DR

ClawCraneNet introduces a top-down, object-level relation approach for text-based video segmentation, improving accuracy and explainability by modeling relations among candidate objects.

Contribution

The paper proposes a novel top-down method that models object-level relations, including positional, semantic, and temporal, for improved text-based video segmentation.

Findings

01

Outperforms state-of-the-art on A2D Sentences and J-HMDB Sentences datasets.

02

Achieves more explainable segmentation results.

03

Effectively models multi-level object relations for better understanding.

Abstract

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling