Contrastive Video-Language Segmentation
Chen Liang, Yawei Luo, Yu Wu, Yi Yang

TL;DR
This paper introduces a contrastive learning approach for video-language segmentation that explicitly aligns referred objects with language descriptions, improving the distinction of semantically similar objects in videos.
Contribution
It proposes a novel contrastive learning framework with hard instance mining strategies to enhance object-language alignment in video segmentation.
Findings
Achieves state-of-the-art results on A2D Sentences and J-HMDB Sentences benchmarks.
Demonstrates improved differentiation between semantically similar objects.
Qualitative results show more accurate object distinction.
Abstract
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content, at the core of formulating a pinpoint vision-language relation. While existing attempts mainly construct such relation in an implicit way, i.e., grid-level multi-modal feature fusion, it has been proven problematic to distinguish semantically similar objects under this paradigm. In this work, we propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective, which directly aligns the referred object and the language description and separates the unreferred content apart across frames. Moreover, to remedy for the degradation problem, we present two complementary hard instance mining strategies, i.e., Language-relevant Channel Filter and Relative Hard Instance Construction. They encourage the network to exclude…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning
