STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Palaash Agrawal, Haidi Azaman, Cheston Tan

TL;DR
STUPD is a large-scale synthetic video dataset designed to improve models' understanding of static and dynamic spatial and temporal relations between objects, aiding in bridging visual and language understanding.
Contribution
The paper introduces STUPD, a novel synthetic dataset with 150K videos and images capturing spatial and temporal relations, including 3D object info, to enhance visual relationship reasoning.
Findings
Models pretrained on STUPD outperform others on real-world datasets.
STUPD covers 30 spatial and 10 temporal relation types.
Synthetic data improves spatial and temporal reasoning in vision models.
Abstract
Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In…
Peer Reviews
Decision·Submitted to ICLR 2024
1. Addresses a gap in existing datasets by including a wider variety of prepositions and introducing dynamic prepositions. 2. Provides a synthetic dataset with both spatial and temporal relations, which is crucial for a more holistic understanding of visual reasoning. 3. Demonstrates the real-world applicability of the dataset through pre-training improvements on visual reasoning tasks. Weaknesses:
1. The synthetic data may not fully capture the complexity of real-world scenarios. 2. The paper could benefit from a more extensive validation of the dataset's efficacy across a broader range of models and tasks.
The main motivation for the dataset creation, which focuses on both spatial and temporal reasoning, is clear, and the dataset comparison table (Table 1) helps the readers understand the landscape of the field. Also, applying the dataset to two spatial real-world tasks with multiple baseline models also well-represents the effectiveness of this dataset. Lastly, visualization of the dataset helps the reader understand what the dataset looks like.
Although this dataset (partially) focused on temporal relationships, it is hard to understand why such categorization (in Figure 2) is valid. Also, I failed to find any experiment employing the STUTD dataset to improve the performance on real-world *temporal* relationship tasks. Apart from the main content, this manuscript may violate Sections 2 and 4.1 of the ICLR 2024 author's guidelines.
The proposed dataset STUPD makes contributions to relation reasoning: - It covers diverse spatial and temporal relations: 30 spatial prepositions and 10 temporal prepositions - It elaborates on the spatial relations that intrinsically involve motion - It is a large-scale dataset Experiments show that pretraining on STUPD increases performance on real-world visual reasoning tasks.
1. Details missing about the evaluation of STUPD pre-training. From the suppl Sec.A.8.2, the SpatialSense/ImageNet-VidVRD experiment only covers 6/10 spatial relations. - I have not found details about which relations have been conducted experiments on. ImageNet-VidVRD defines 132 predicates. Some of them are "static" but not "dynamic" spatial relations. Also, the spatial relations are connected with verbs e.g., swim_behind, and fly_behind. It is not clear howto handle the different defin
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
