Video Relation Detection via Tracklet based Visual Transformer

Kaifeng Gao; Long Chen; Yifeng Huang; Jun Xiao

arXiv:2108.08669·cs.CV·August 20, 2021

Video Relation Detection via Tracklet based Visual Transformer

Kaifeng Gao, Long Chen, Yifeng Huang, Jun Xiao

PDF

1 Repo

TL;DR

This paper introduces a novel tracklet-based visual Transformer for video relation detection that leverages tracklet proposals and a temporal-aware decoder, significantly improving performance on the VRU Grand Challenge.

Contribution

It proposes a new tracklet-based approach with a specialized visual Transformer and temporal-aware decoder for improved video relation detection.

Findings

01

Outperforms existing methods by a large margin on VRU Grand Challenge

02

Demonstrates the effectiveness of tracklet-based visual Transformers

03

Validates the superiority of the proposed approach through extensive experiments

Abstract

Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years. In this paper, we apply the state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to generate tracklet proposals. Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations. Specifically, we design a tracklet-based visual Transformer. It contains a temporal-aware decoder which performs feature interactions between the tracklets and learnable predicate query embeddings, and finally predicts the relations. Experimental results strongly demonstrate the superiority of our method, which outperforms other methods by a large margin on the Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021. Codes are released at https://github.com/Dawn-LX/VidVRD-tracklets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dawn-lx/vidvrd-tracklets
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Label Smoothing · Softmax · Byte Pair Encoding