TL;DR
PhysVidBench is a new benchmark that evaluates the physical commonsense reasoning of text-to-video models through curated prompts and a three-stage evaluation pipeline, addressing a key gap in current video generation quality.
Contribution
This paper introduces PhysVidBench, the first comprehensive benchmark for assessing physical reasoning in text-to-video models, combining prompt curation and a novel evaluation methodology.
Findings
State-of-the-art models often fail basic physical reasoning tasks.
PhysVidBench reveals specific weaknesses in current T2V models.
The evaluation pipeline effectively measures physical commonsense understanding.
Abstract
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The most significant strength is the novel evaluation pipeline. By decoupling perception (VLM dense captioning ) from reasoning (LLM-as-a-judge ), the method cleverly avoids the hallucination and "response collapse" pitfalls common in direct video-QA. 2. The benchmark itself is well-designed and highly focused. 3. The benchmark offers strong diagnostic utility beyond simple scoring. It provides a fine-grained breakdown of model failures across seven distinct physical dimensions (Table 2)
1. The evaluation is entirely dependent on the AuroraCap VLM. If the captioner fails to observe a correctly generated physical detail, the T2V model is unfairly penalized for a failure in the evaluation pipeline itself. 2. The benchmark's questions are all designed to have "Yes" as the ground truth. This only tests a model's ability to produce a correct phenomenon and fails to test if it can avoid producing a physically implausible one. 3. All 383 prompts are adapted from the PIQA dataset. W
1. Clear motivation and real-world relevance The paper targets an underexplored yet essential capability: everyday physical commonsense. Unlike prior work focusing on abstract physics laws or motion smoothness, PhysVidBench tests realistic, goal-oriented interactions involving tool use, and material behavior. 2. Comprehensive and systematic evaluation The benchmark spans seven reasoning dimensions (force, motion, affordance, material transformation, etc.) and incorporates a difficulty-based stra
1. Evaluation of realism and ceiling Although the caption-based QA pipeline reduces hallucination, it evaluates textual rather than visual physical understanding. The final judgment depends on the only one captioner’s recall and Gemini’s internal physics priors. A single model may have deviations. The pipeline measures consistency within the text–caption–QA chain, not direct perception–reasoning, may introduce biases. 2. Lack of statistical significance reporting Performance differences (e.g.,
1. The evaluation methodology is relatively novel and interesting, offering a new perspective on assessing the physical understanding of video generation models. 2. The division of evaluation into seven commonsense dimensions appears comprehensive and covers a wide range of physical reasoning aspects. 3. The proposed iterative, error-guided prompt refinement approach for video generation is potentially inspiring for future research on video generation agents.
1. The presentation of the paper is not very good. For example, in Figure 3, the text in Stage 4 is incomplete and partially covered. The classification in Figure 1 into Real-World Videos and Synthetically Generated Videos is unclear in purpose. Also, Figure 4 looks more like a slide for a report rather than a figure suitable for an academic paper. 2. I have some concerns about the details of the evaluation process. In evaluation Step 1, the LLM generates ground-truth "yes" questions purely base
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
