TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

arXiv:2203.16434·cs.CV·June 10, 2022

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

PDF

1 Repo

TL;DR

TubeDETR introduces a transformer-based model for accurately localizing spatio-temporal tubes in videos based on text queries, effectively modeling multi-modal interactions and outperforming previous methods on benchmark datasets.

Contribution

The paper presents a novel transformer architecture for spatio-temporal video grounding that jointly models spatial, temporal, and multi-modal interactions with efficient encoding and decoding.

Findings

01

Outperforms state-of-the-art on VidSTG and HC-STVG benchmarks

02

Demonstrates effectiveness of space-time decoder and multi-modal encoder components

03

Provides extensive ablation studies validating design choices

Abstract

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

antoyang/TubeDETR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax