TL;DR
This paper introduces a Spatio-Temporal Graph Transformer (STGT) that models spatial and temporal relationships in videos to improve video-language alignment, enhancing performance in retrieval and question answering tasks.
Contribution
The paper proposes a novel STGT module that integrates spatio-temporal graph structures with transformer attention for better video-language alignment.
Findings
Superior performance on video-text retrieval tasks
Effective modeling of spatio-temporal relationships
Improved accuracy in video question answering
Abstract
Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Laplacian EigenMap · Laplacian Positional Encodings · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout
