Exploiting Feature Diversity for Make-up Temporal Video Grounding

Xiujun Shu; Wei Wen; Taian Guo; Sunan He; Chen Wu; Ruizhi Qiao

arXiv:2208.06179·cs.CV·August 15, 2022

Exploiting Feature Diversity for Make-up Temporal Video Grounding

Xiujun Shu, Wei Wen, Taian Guo, Sunan He, Chen Wu, Ruizhi Qiao

PDF

Open Access

TL;DR

This paper introduces a novel approach for temporal video grounding in make-up videos by leveraging feature diversity to capture fine-grained semantics, outperforming traditional action-based methods.

Contribution

It proposes a new method exploiting feature diversity for better fine-grained video-text alignment in make-up step localization.

Findings

01

Achieved 3rd place in MTVG competition

02

Enhanced fine-grained video feature extraction

03

Improved temporal localization accuracy

Abstract

This technical report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022. MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description. The biggest challenge of this task is the fi ne-grained video-text semantics of make-up steps. However, current methods mainly extract video features using action-based pre-trained models. As actions are more coarse-grained than make-up steps, action-based features are not sufficient to provide fi ne-grained cues. To address this issue,we propose to achieve fi ne-grained representation via exploiting feature diversities. Specifically, we proposed a series of methods from feature extraction, network optimization, to model ensemble. As a result, we achieved 3rd place in the MTVG competition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition