$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

Chia-Hsiang Kao; Cong Phuoc Huynh; Chien-Yi Wang; Noranart Vesdapunt; Stefan Stojanov; Bharath Hariharan; Oleksandr Obiednikov; Ning Zhou

arXiv:2605.20576·cs.CV·May 21, 2026

$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou

PDF

TL;DR

$ abla$ynamics introduces a language-based framework that infers rigid-body dynamics from videos by generating scene descriptions in text, enabling better generalization and transfer to real-world data.

Contribution

The paper presents $ abla$ynamics, a novel vision-language approach that uses structured text to represent physics scenes, improving generalization and transfer in rigid-body dynamics inference.

Findings

01

Achieves 0.30 segmentation IoU on CLEVRER, 7x better than top VLMs.

02

Test-time sampling improves IoU by 27%.

03

Evolutionary search boosts IoU by 120%.

Abstract

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $Δ$ YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $Δ$ YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $Δ$ YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.