ViGT: Proposal-free Video Grounding with Learnable Token in Transformer
Kun Li, Dan Guo, Meng Wang

TL;DR
This paper introduces ViGT, a novel proposal-free video grounding framework using a learnable regression token in a transformer, which improves boundary prediction by avoiding complex cross-modal interactions.
Contribution
It proposes a simple, effective boundary regression paradigm with a learnable token that enhances video grounding without complex feature fusion.
Findings
Achieved competitive results on ANet Captions, TACoS, and YouCookII datasets.
Demonstrated the interpretability and effectiveness of the learnable regression token.
Validated the approach through extensive ablation studies.
Abstract
The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection
