Structured Video-Language Modeling with Temporal Grouping and Spatial   Grounding

Yuanhao Xiong; Long Zhao; Boqing Gong; Ming-Hsuan Yang; Florian; Schroff; Ting Liu; Cho-Jui Hsieh; Liangzhe Yuan

arXiv:2303.16341·cs.CV·September 10, 2024·1 cites

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian, Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan

PDF

Open Access 1 Video

TL;DR

This paper introduces S-ViLM, a novel video-language model that enhances fine-grained spatial and temporal understanding through inter-clip spatial grounding and intra-clip temporal grouping, improving performance on multiple downstream tasks.

Contribution

The paper proposes a new framework, S-ViLM, that captures region-object correspondences and scene changes by exploiting intrinsic structures of video and text modalities.

Findings

01

Outperforms existing methods on four downstream tasks

02

Achieves significant improvements in text-video retrieval and video question answering

03

Enhances temporal localization and semantic reasoning capabilities

Abstract

Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-language modeling framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning