TL;DR
PhyMAGIC is a training-free framework that generates physically consistent 3D motion from a single image by integrating diffusion models, LLM-guided reasoning, and physics simulation, improving realism and physical accuracy.
Contribution
It introduces a novel, training-free approach combining diffusion models, LLM reasoning, and physics simulation for physically plausible motion generation from a single image.
Findings
Outperforms state-of-the-art video generators in physical consistency.
Enhances physical property inference and motion-text alignment.
Maintains high visual fidelity in generated content.
Abstract
Recent advances in 3D content generation have amplified demand for dynamic models that are both visually realistic and physically consistent. However, state-of-the-art video diffusion models frequently produce implausible results such as momentum violations and object interpenetrations. Existing physics-aware approaches often rely on task-specific fine-tuning or supervised data, which limits their scalability and applicability. To address the challenge, we present PhyMAGIC, a training-free framework that generates physically consistent motion from a single image. PhyMAGIC integrates a pre-trained image-to-video diffusion model, confidence-guided reasoning via LLMs, and a differentiable physics simulator to produce 3D assets ready for downstream physical simulation without fine-tuning or manual supervision. By iteratively refining motion prompts using LLM-derived confidence scores and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
PhyMAGIC is a training-free framework generating physically consistent motion from a single image. It uses confidence-guided LLM reasoning in a closed loop, achieving superior plausibility, motion-text alignment, and over 10× speedup compared to trained baselines.
The primary limitation is that inferring physical properties from a single static image is fundamentally under-constrained. This necessitates a closed-loop iterative reasoning process to resolve ambiguities and achieve high accuracy. The framework also relies on external models, some of which require high computational resources.
1. The framework's strength is its training-free nature. It orchestrates existing, powerful foundation models (I2V, LLM) without requiring costly fine-tuning or massive annotated physics datasets, making it scalable and efficient. 2. The core methodological innovation is the closed-loop mechanism. The system intelligently "probes" the scene with different motions to resolve ambiguity, leading to progressively more accurate physical property inference. 3. It outperforms both standard video genera
1. While training-free, the iterative inference process is computationally intensive and slow, requiring multiple calls to both a VDM and an LLM, followed by a physics simulation. It would be better to compare the inference time. 2. The entire system's performance is fundamentally bottlenecked by the capabilities of its components. A failure in the VDM's ability to follow a prompt, or a flaw in the LLM's physical reasoning, will directly degrade or break the entire pipeline. More analysis of th
1. The writing is clear. 2. The integration of LLMs, generation models, and physics simulators is technically compelling. 3. The problem addressed is interesting and meaningful.
1. What is the ultimate goal of this work — generating 3D or generating video? My understanding is that rendering videos from 3D Gaussians is the final objective, with the video primarily serving as a source of physical information. In that case, is a video generation model truly necessary? Could the physical quantities instead be provided directly by an LLM’s prior knowledge or manually specified by humans? Would this affect the final results? Overall, the connection between video generation an
1. The paper introduces a conceptually elegant and practically valuable approach that unifies pretrained video diffusion models, large language model reasoning, and differentiable physics simulation into a closed-loop pipeline. This design enables physical property inference and motion synthesis from a single static image without any task-specific fine-tuning or annotated data, which is a notable advance in the direction of scalable physics-aware generation. 2. The authors provide clear physical
1. The paper does not include supplementary videos or visual demonstrations, which are crucial for evaluating a method that claims improvements in physical realism and motion plausibility. From the static figures alone, it is difficult to fully assess the perceptual quality and physical consistency of the generated dynamics. In particular, Figure 5 lacks ground-truth visualizations, making it hard to judge whether the proposed approach indeed performs better than the baselines in practice. The a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
