Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video   Grounding

Zeyu Xiong (1); Daizong Liu (2); Pan Zhou (1) ((1) The Hubei; Engineering Research Center on Big Data Security; School of Cyber Science and; Engineering; Huazhong University of Science; Technology; (2) Wangxuan; Institute of Computer Technology; Peking University)

arXiv:2207.00744·cs.CV·July 5, 2022

Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

Zeyu Xiong (1), Daizong Liu (2), Pan Zhou (1) ((1) The Hubei, Engineering Research Center on Big Data Security, School of Cyber Science and, Engineering, Huazhong University of Science, Technology, (2) Wangxuan, Institute of Computer Technology, Peking University)

PDF

Open Access

TL;DR

This paper introduces GKCMN, an anchor-free, Gaussian kernel-based network for spatio-temporal video grounding that effectively models spatial and temporal relations without relying on anchor boxes.

Contribution

The paper proposes the first anchor-free framework for STVG using Gaussian kernels and a mixed connection network to improve spatial-temporal modeling.

Findings

01

Outperforms previous methods on VidSTG dataset

02

Effectively models temporal relations among video frames

03

Utilizes Gaussian heatmaps for precise object localization

Abstract

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims to localize the spatio-temporal tube of the interested object semantically according to a natural language query. Most previous works not only severely rely on the anchor boxes extracted by Faster R-CNN, but also simply regard the video as a series of individual frames, thus lacking their temporal modeling. Instead, in this paper, we are the first to propose an anchor-free framework for STVG, called Gaussian Kernel-based Cross Modal Network (GKCMN). Specifically, we utilize the learned Gaussian Kernel-based heatmaps of each video frame to locate the query-related object. A mixed serial and parallel connection network is further developed to leverage both spatial and temporal relations among frames for better grounding. Experimental results on VidSTG dataset demonstrate the effectiveness of our proposed GKCMN.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsRoIPool · Convolution · Region Proposal Network · Softmax · Faster R-CNN