TL;DR
This paper introduces a novel tracklet-based visual Transformer for video relation detection that leverages tracklet proposals and a temporal-aware decoder, significantly improving performance on the VRU Grand Challenge.
Contribution
It proposes a new tracklet-based approach with a specialized visual Transformer and temporal-aware decoder for improved video relation detection.
Findings
Outperforms existing methods by a large margin on VRU Grand Challenge
Demonstrates the effectiveness of tracklet-based visual Transformers
Validates the superiority of the proposed approach through extensive experiments
Abstract
Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years. In this paper, we apply the state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to generate tracklet proposals. Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations. Specifically, we design a tracklet-based visual Transformer. It contains a temporal-aware decoder which performs feature interactions between the tracklets and learnable predicate query embeddings, and finally predicts the relations. Experimental results strongly demonstrate the superiority of our method, which outperforms other methods by a large margin on the Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021. Codes are released at https://github.com/Dawn-LX/VidVRD-tracklets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Label Smoothing · Softmax · Byte Pair Encoding
