Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph   Generation

Tao Pu; Tianshui Chen; Hefeng Wu; Yongyi Lu; Liang Lin

arXiv:2309.13237·cs.CV·December 18, 2023

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Tao Pu, Tianshui Chen, Hefeng Wu, Yongyi Lu, Liang Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces STKET, a transformer model that embeds spatial-temporal prior knowledge into video scene graph generation, significantly improving relationship prediction accuracy in videos.

Contribution

The work proposes a novel spatial-temporal knowledge-embedded transformer that incorporates prior correlations into the attention mechanism for better VidSGG performance.

Findings

01

Outperforms existing algorithms with up to 8.1% improvement in mR@50.

02

Effectively models spatial co-occurrence and temporal transition correlations.

03

Achieves significant accuracy gains across different experimental settings.

Abstract

Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hcplab-sysu/stket
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization