Described Spatial-Temporal Video Detection
Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar, Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann

TL;DR
This paper introduces a new task called described spatial-temporal video detection (DSTVD), a new dataset DVD-ST, and baseline models extending existing methods to handle multiple and none objects in videos based on language queries.
Contribution
The paper advances spatial-temporal video grounding to DSTVD, introduces the DVD-ST dataset, and develops baseline models for multi-entity and none detection in videos.
Findings
Baseline models achieve promising results on DVD-ST.
The dataset covers over 150 diverse entities.
Extensive analysis provides insights for future research.
Abstract
Detecting visual content on language expression has become an emerging topic in the community. However, in the video domain, the existing setting, i.e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) by overcoming the above limitation. To facilitate the exploration of DSTVD, we first introduce a new benchmark, namely DVD-ST. Notably, DVD-ST supports grounding from none to many objects onto the video in response to queries and encompasses a diverse range of over 150 entities, including appearance, actions, locations, and interactions. The extensive breadth and diversity of the DVD-ST dataset make it an exemplary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
