End-to-end Open-vocabulary Video Visual Relationship Detection using   Multi-modal Prompting

Yongqi Wang; Xinxiao Wu; Shuo Yang; Jiebo Luo

arXiv:2409.12499·cs.CV·April 15, 2025

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

Yongqi Wang, Xinxiao Wu, Shuo Yang, Jiebo Luo

PDF

Open Access

TL;DR

This paper introduces an end-to-end open-vocabulary video visual relationship detection framework that unifies object trajectory detection and relationship classification, leveraging multi-modal prompting and CLIP for improved generalization to unseen categories.

Contribution

It proposes a novel unified framework with a relationship-aware trajectory detector and multi-modal prompting, enhancing open-vocabulary relationship detection in videos beyond existing methods.

Findings

01

Outperforms existing methods on VidVRD and VidOR datasets

02

Demonstrates strong generalization in cross-dataset scenarios

03

Effectively detects unseen object relationships in videos

Abstract

Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Residual Connection