TL;DR
TubeDETR introduces a transformer-based model for accurately localizing spatio-temporal tubes in videos based on text queries, effectively modeling multi-modal interactions and outperforming previous methods on benchmark datasets.
Contribution
The paper presents a novel transformer architecture for spatio-temporal video grounding that jointly models spatial, temporal, and multi-modal interactions with efficient encoding and decoding.
Findings
Outperforms state-of-the-art on VidSTG and HC-STVG benchmarks
Demonstrates effectiveness of space-time decoder and multi-modal encoder components
Provides extensive ablation studies validating design choices
Abstract
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
