STOA-VLP: Spatial-Temporal Modeling of Object and Action for   Video-Language Pre-training

Weihong Zhong; Mao Zheng; Duyu Tang; Xuan Luo; Heng Gong; Xiaocheng; Feng; Bing Qin

arXiv:2302.09736·cs.CV·November 10, 2023·1 cites

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng, Feng, Bing Qin

PDF

Open Access

TL;DR

STOA-VLP introduces a fine-grained pre-training framework for video-language models that jointly models object trajectories and actions across space and time, improving downstream task performance.

Contribution

It is the first to incorporate object and action fine-grained information with auxiliary tasks in video-language pre-training.

Findings

01

3.7 Rouge-L improvement on MSR-VTT captioning

02

2.9% accuracy increase on MSVD VQA

03

Effective modeling of object trajectories and actions

Abstract

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques