3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism
Bhavik Chandna, Kelsey R. Allen

TL;DR
3DSPA is an automated, reference-free evaluation framework for video realism that combines 3D trajectories, depth, and semantic features to assess physical plausibility and temporal consistency, aligning well with human judgments.
Contribution
The paper introduces 3DSPA, a novel 3D spatiotemporal point autoencoder that integrates multiple modalities for improved video realism evaluation without reference videos.
Findings
3DSPA effectively detects physical law violations in videos.
It is more sensitive to motion artifacts than existing methods.
Aligns closely with human judgments of video quality.
Abstract
AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. Paper is well written and organized. 2. Novel 3D semantic-physical representation: Proposes a 3D point autoencoder that jointly encodes geometric motion and semantic features, allowing more comprehensive realism assessment.. 3. Strong empirical validation and interpretability: Experiments valid 3DSPA is capable of reconstructing 3D point tracks and detect physical rule violations. Extensive results show consistent superiority over previous evaluators.
1. Reliance on additional models: The 3d points are obtained with CoTracker3 and Video Depth Anything. These models are strong in point tracking and metric depth estimation, yet still have limitations, for example, cotracker3 may fail with small objects in the video and VDA may face issues with sharp lines. Also, dramatic motions or camera movements remain problematic in geometric models. Will these possible artifacts or errors of geometric prediction affect the results of 3DSPA? Look forward to
Novel Framework: 3DSPA integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation, allowing for robust assessments of video realism, temporal consistency, and physical plausibility without requiring a reference video. Enhanced Realism Detection: It reliably identifies videos that violate physical laws, is highly sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism compared to existing methods. Super
1. Representation weakness: Table 1 is out of page, and the paper is less than 9 pages in length. 2. The difference between 3DSPA and TRAJAN+DINO/+3D need more disscusions. 3. Lacking of ablations: There is no ablation studies in paper and supp. Only comparison with TRAJAN variants are involved. Ablations of different components in 3DSPA are expected. 4. More visualization results are expected, including 3D point tracks reconstructed by 3DSPA for more videos, and more unrealistic videos which 3
1. This paper firstly proposes the way to evaluate the realism of generated videos in a 3D perspective, which is a promising way for future works. 2. The result shows that 3DSPA works accurately both in real and generated videos. 3. The whole paper is well writen and easy to understand.
1. The authors did not provide a clear rationale in the paper for the necessity of the supported point track input; its effectiveness is only demonstrated through ablation experiments. It would be better if the paper included more discussion and justification for this component. 2. The way for selecting the initial point is unclear, if utilizing uniform selection strategy, how to ensure the point will fall into the entity with motion or fast motion (the scene/entity changes several time in a vid
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · 3D Shape Modeling and Analysis
