Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Yimeng Zhang; Xin Chen; Jinghan Jia; Sijia Liu; Ke Ding

arXiv:2303.04995·cs.CV·October 5, 2023·1 cites

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a text-visual prompting framework for efficient 2D temporal video grounding, significantly reducing computational complexity while boosting performance on benchmark datasets.

Contribution

The authors propose a novel prompting approach that enables effective 2D TVG, replacing 3D CNNs, and introduce a new loss function for improved learning.

Findings

01

Achieves up to 30.77% performance improvement on ActivityNet Captions.

02

Provides 5x faster inference compared to 3D CNN-based methods.

03

Demonstrates effectiveness on Charades-STA and ActivityNet datasets.

Abstract

In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intel/TVP
pytorchOfficial

Models

🤗
Intel/tvp-base
model· 370 dl· ♡ 1
370 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization