Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs
Jingyi Wang, Jinfa Huang, Can Zhang, and Zhidong Deng

TL;DR
This paper introduces TR$^2$, a transformer-based model that effectively captures the changing relations in dynamic scene graphs over time, improving semantic understanding in video analysis tasks.
Contribution
The paper proposes a novel time-variant relation modeling approach using a transformer and cross-modality supervision, advancing dynamic scene graph generation.
Findings
TR$^2$ outperforms previous methods by 2.1% and 2.6% on the Action Genome dataset.
Explicit relation supervision via text embeddings enhances relation learning.
The method effectively models temporal relation changes in videos.
Abstract
Dynamic scene graphs generated from video clips could help enhance the semantic visual understanding in a wide range of challenging tasks such as environmental perception, autonomous navigation, and task planning of self-driving vehicles and mobile robots. In the process of temporal and spatial modeling during dynamic scene graph generation, it is particularly intractable to learn time-variant relations in dynamic scene graphs among frames. In this paper, we propose a Time-variant Relation-aware TRansformer (TR), which aims to model the temporal change of relations in dynamic scene graphs. Explicitly, we leverage the difference of text embeddings of prompted sentences about relation labels as the supervision signal for relations. In this way, cross-modality feature guidance is realized for the learning of time-variant relations. Implicitly, we design a relation feature fusion module…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
