TL;DR
VrdONE is a novel one-stage video visual relation detection model that efficiently captures spatiotemporal interactions between entities in videos, achieving state-of-the-art results without complex multi-step processes.
Contribution
It introduces VrdONE, a streamlined one-stage model with a Subject-Object Synergy module for improved relation detection across various temporal scales.
Findings
Achieves state-of-the-art performance on VidOR and ImageNet-VidVRD benchmarks.
Effectively captures both short-lived and long-lasting relations in videos.
Eliminates the need for proposal generation and post-processing steps.
Abstract
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
