In Defense of Clip-based Video Relation Detection
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

TL;DR
This paper advocates for the clip-based approach in video relation detection, demonstrating that enhanced spatial and temporal context modeling with a hierarchical model outperforms video tubelet methods, achieving state-of-the-art results.
Contribution
The paper introduces a Hierarchical Context Model that significantly improves clip-based VidVRD by better modeling spatial and temporal contexts, surpassing existing video tubelet methods.
Findings
Clip-based methods can outperform video tubelet approaches.
Hierarchical Context Model achieves state-of-the-art results.
Enhanced context modeling improves relation detection accuracy.
Abstract
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
