Cross-Modality Time-Variant Relation Learning for Generating Dynamic   Scene Graphs

Jingyi Wang; Jinfa Huang; Can Zhang; and Zhidong Deng

arXiv:2305.08522·cs.CV·May 16, 2023·1 cites

Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Jingyi Wang, Jinfa Huang, Can Zhang, and Zhidong Deng

PDF

Open Access 1 Repo

TL;DR

This paper introduces TR$^2$, a transformer-based model that effectively captures the changing relations in dynamic scene graphs over time, improving semantic understanding in video analysis tasks.

Contribution

The paper proposes a novel time-variant relation modeling approach using a transformer and cross-modality supervision, advancing dynamic scene graph generation.

Findings

01

TR$^2$ outperforms previous methods by 2.1% and 2.6% on the Action Genome dataset.

02

Explicit relation supervision via text embeddings enhances relation learning.

03

The method effectively models temporal relation changes in videos.

Abstract

Dynamic scene graphs generated from video clips could help enhance the semantic visual understanding in a wide range of challenging tasks such as environmental perception, autonomous navigation, and task planning of self-driving vehicles and mobile robots. In the process of temporal and spatial modeling during dynamic scene graph generation, it is particularly intractable to learn time-variant relations in dynamic scene graphs among frames. In this paper, we propose a Time-variant Relation-aware TRansformer (TR $^{2}$ ), which aims to model the temporal change of relations in dynamic scene graphs. Explicitly, we leverage the difference of text embeddings of prompted sentences about relation labels as the supervision signal for relations. In this way, cross-modality feature guidance is realized for the learning of time-variant relations. Implicitly, we design a relation feature fusion module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qncsn2016/TR2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling