End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
Yongqi Wang, Xinxiao Wu, Shuo Yang, Jiebo Luo

TL;DR
This paper introduces an end-to-end open-vocabulary video visual relationship detection framework that unifies object trajectory detection and relationship classification, leveraging multi-modal prompting and CLIP for improved generalization to unseen categories.
Contribution
It proposes a novel unified framework with a relationship-aware trajectory detector and multi-modal prompting, enhancing open-vocabulary relationship detection in videos beyond existing methods.
Findings
Outperforms existing methods on VidVRD and VidOR datasets
Demonstrates strong generalization in cross-dataset scenarios
Effectively detects unseen object relationships in videos
Abstract
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Residual Connection
