Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding
Jiaxiu Li, Kun Li, Jia Li, Guoliang Chen, Dan Guo, Meng Wang

TL;DR
This paper introduces DPTMO, a dual-path network that improves fine-grained make-up video grounding by capturing detailed semantic cues through query-agnostic and query-guided features, leading to more accurate localization.
Contribution
The paper proposes a novel dual-path proposal-based framework that effectively captures fine-grained semantic details in make-up videos, surpassing existing methods in accuracy.
Findings
DPTMO outperforms previous methods on the YouMakeup dataset.
Dual-path structure enhances semantic comprehension of make-up activities.
Joint optimization of two proposal sets improves timestamp prediction accuracy.
Abstract
Make-up temporal video grounding (MTVG) aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video. Compared with the general video grounding task, MTVG focuses on meticulous actions and changes on the face. The make-up instruction step, usually involving detailed differences in products and facial areas, is more fine-grained than general activities (e.g, cooking activity and furniture assembly). Thus, existing general approaches cannot locate the target activity effectually. More specifically, existing proposal generation modules are not yet fully developed in providing semantic cues for the more fine-grained make-up semantic comprehension. To tackle this issue, we propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Human Pose and Action Recognition
