Video-Language Alignment via Spatio-Temporal Graph Transformer

Shi-Xue Zhang; Hongfa Wang; Xiaobin Zhu; Weibo Gu; Tianjin Zhang; Chun; Yang; Wei Liu; Xu-Cheng Yin

arXiv:2407.11677·cs.CV·July 25, 2024

Video-Language Alignment via Spatio-Temporal Graph Transformer

Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun, Yang, Wei Liu, Xu-Cheng Yin

PDF

1 Repo

TL;DR

This paper introduces a Spatio-Temporal Graph Transformer (STGT) that models spatial and temporal relationships in videos to improve video-language alignment, enhancing performance in retrieval and question answering tasks.

Contribution

The paper proposes a novel STGT module that integrates spatio-temporal graph structures with transformer attention for better video-language alignment.

Findings

01

Superior performance on video-text retrieval tasks

02

Effective modeling of spatio-temporal relationships

03

Improved accuracy in video question answering

Abstract

Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gxym/stgt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Laplacian EigenMap · Laplacian Positional Encodings · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout