$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou

TL;DR
$ abla$ynamics introduces a language-based framework that infers rigid-body dynamics from videos by generating scene descriptions in text, enabling better generalization and transfer to real-world data.
Contribution
The paper presents $ abla$ynamics, a novel vision-language approach that uses structured text to represent physics scenes, improving generalization and transfer in rigid-body dynamics inference.
Findings
Achieves 0.30 segmentation IoU on CLEVRER, 7x better than top VLMs.
Test-time sampling improves IoU by 27%.
Evolutionary search boosts IoU by 120%.
Abstract
Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
