ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

Kun Li; Dan Guo; Meng Wang

arXiv:2308.06009·cs.CV·August 14, 2023

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

Kun Li, Dan Guo, Meng Wang

PDF

TL;DR

This paper introduces ViGT, a novel proposal-free video grounding framework using a learnable regression token in a transformer, which improves boundary prediction by avoiding complex cross-modal interactions.

Contribution

It proposes a simple, effective boundary regression paradigm with a learnable token that enhances video grounding without complex feature fusion.

Findings

01

Achieved competitive results on ANet Captions, TACoS, and YouCookII datasets.

02

Demonstrated the interpretability and effectiveness of the learnable regression token.

03

Validated the approach through extensive ablation studies.

Abstract

The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection