GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
Nicolae Cudlenco, Mihai Masala, Marius Leordeanu

TL;DR
GTASA introduces a comprehensive multi-actor video dataset with detailed annotations and a novel system for generating and evaluating complex videos, advancing spatiotemporal reasoning and model training.
Contribution
The paper presents GTASA, a new dataset with ground truth annotations and GEST-Engine for generating realistic multi-actor videos, improving evaluation and training of video models.
Findings
GTASA enables better evaluation of neural video generators.
Self-supervised encoders outperform VLM encoders in spatial reasoning tasks.
GTASA improves training of video captioning models.
Abstract
Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA's exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
