TL;DR
This paper benchmarks off-the-shelf generative video models for predictive display in teleoperation, revealing current limitations in real-time, low-error, short-horizon prediction without task-specific tuning.
Contribution
It introduces a zero-shot benchmarking pipeline for evaluating generative video models in teleoperation scenarios, highlighting the gap between general models and practical predictive display needs.
Findings
No model achieves low error, real-time inference, and stable predictions simultaneously.
Increasing model size or resolution offers limited or negative improvements.
Practical deployment requires adaptation or optimization beyond off-the-shelf models.
Abstract
Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
