2rd Place Solutions in the HC-STVG track of Person in Context Challenge   2021

YiYu; XinyingWang; WeiHu; XunLuo; ChengLi

arXiv:2106.07166·cs.CV·June 15, 2021·6 cites

2rd Place Solutions in the HC-STVG track of Person in Context Challenge 2021

YiYu, XinyingWang, WeiHu, XunLuo, ChengLi

PDF

Open Access

TL;DR

This paper presents a solution for spatio-temporal person localization in videos based on sentences, achieving second place in the HC-STVG track of the Person in Context Challenge 2021, by combining attribute filtering, advanced tracking, and cross-modal transformers.

Contribution

The approach integrates human attribute filtering, improved tracking with FastReID, and a visual transformer for cross-modal representation, which is novel for this challenge.

Findings

01

Achieved second place with vIOU of 0.30025

02

Effective filtering of proposals using sentence-derived attributes

03

Utilized a visual transformer for cross-modal localization

Abstract

In this technical report, we present our solution to localize a spatio-temporal person in an untrimmed video based on a sentence. We achieve the second vIOU(0.30025) in the HC-STVG track of the 3rd Person in Context(PIC) Challenge. Our solution contains three parts: 1) human attributes information is extracted from the sentence, it is helpful to filter out tube proposals in the testing phase and supervise our classifier to learn appearance information in the training phase. 2) we detect humans with YoloV5 and track humans based on the DeepSort framework but replace the original ReID network with FastReID. 3) a visual transformer is used to extract cross-modal representations for localizing a spatio-temporal tube of the target person.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods