TL;DR
NEWTON introduces an agentic planning framework that enhances physically grounded video generation by orchestrating tools and iterative verification, significantly improving physical commonsense accuracy.
Contribution
It proposes a novel planning-based approach with a verifier for physically grounded video generation, addressing the specification bottleneck in prior models.
Findings
Improved joint accuracy from 21.4% to 29.7% on LTX-Video.
Enhanced accuracy from 30.7% to 37.4% on Veo-3.1.
Demonstrated effectiveness without modifying the underlying generator.
Abstract
Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
