PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Yu-Wei Zhan; Xin Wang; Hong Chen; Tongtong Feng; Wei Feng; Ren Wang; Guangyao Li; Qing Li; Wenwu Zhu

arXiv:2512.04532·cs.CV·December 5, 2025

PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, Wenwu Zhu

PDF

Open Access

TL;DR

PhyVLLM introduces a physics-guided video-language model that disentangles motion and appearance, modeling physical dynamics with Neural ODEs to improve understanding of physical interactions in videos.

Contribution

The paper presents a novel framework that explicitly incorporates physical motion modeling into Video LLMs using disentanglement and Neural ODEs, enhancing physical reasoning capabilities.

Findings

01

Outperforms state-of-the-art Video LLMs on physical reasoning tasks.

02

Effectively disentangles appearance and motion for better physical understanding.

03

Uses self-supervised learning to model continuous physical dynamics.

Abstract

Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition