In Defense of Clip-based Video Relation Detection

Meng Wei; Long Chen; Wei Ji; Xiaoyu Yue; Roger Zimmermann

arXiv:2307.08984·cs.CV·July 19, 2023

In Defense of Clip-based Video Relation Detection

Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

PDF

Open Access

TL;DR

This paper advocates for the clip-based approach in video relation detection, demonstrating that enhanced spatial and temporal context modeling with a hierarchical model outperforms video tubelet methods, achieving state-of-the-art results.

Contribution

The paper introduces a Hierarchical Context Model that significantly improves clip-based VidVRD by better modeling spatial and temporal contexts, surpassing existing video tubelet methods.

Findings

01

Clip-based methods can outperform video tubelet approaches.

02

Hierarchical Context Model achieves state-of-the-art results.

03

Enhanced context modeling improves relation detection accuracy.

Abstract

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training