Human-centric Spatio-Temporal Video Grounding via the Combination of   Mutual Matching Network and TubeDETR

Fan Yu; Zhixiang Zhao; Yuchen Wang; Yi Xu; Tongwei Ren; Gangshan Wu

arXiv:2207.04201·cs.MM·August 16, 2022

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu

PDF

Open Access

TL;DR

This paper presents a human-centric spatio-temporal video grounding method combining TubeDETR and Mutual Matching Network, achieving third place in a challenge by effectively localizing persons in videos based on text descriptions.

Contribution

The paper introduces a novel combination of TubeDETR and MMN for improved spatio-temporal grounding of persons in videos, integrating spatial and temporal localization.

Findings

01

Achieved third place in the 4th PIC challenge.

02

Effectively combines spatial and temporal localization.

03

Improved accuracy in person grounding in videos.

Abstract

In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and with temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications