Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua,, Shuicheng Yan

TL;DR
This paper introduces Finsta, a novel fine-grained structural spatio-temporal alignment method that enhances video-language models by representing data with scene graphs and employing graph Transformers for improved cross-modal understanding.
Contribution
Finsta is a plug-and-play framework that unifies scene graph representations for texts and videos, improving alignment and performance without retraining from scratch or needing additional annotations.
Findings
Consistently improves 13 strong VLMs across 12 datasets.
Significantly enhances performance in both fine-tuning and zero-shot settings.
Achieves state-of-the-art results on multiple video-language tasks.
Abstract
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Softmax · Layer Normalization · Laplacian EigenMap · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
