TL;DR
This paper explores how pretraining on video data with diffusion models enhances visual understanding and adaptability, showing superior data efficiency over language models in various tasks.
Contribution
It introduces the use of video diffusion models for pretraining, demonstrating their potential as visual foundation models with strong inductive biases.
Findings
VDMs outperform LLMs in data efficiency across multiple benchmarks.
Video pretraining provides beneficial inductive biases for visual tasks.
VDMs show promise for broad visual problem-solving capabilities.
Abstract
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The motivation of this work is good; the questions the authors raised deserve to have a work to study them. 2. The curated tasks of interest are interesting; they designed synthetic tasks to test them 3. They show some results that pretraining on VDM modality specific tasks would improve its downstream performances.
1. The synthetic tasks are too simplified to be indicative to downstream or other tasks performance. If the authors can show some downstream application enhancements, even just 1 example, then it would be more convincing. 2. The authors compare LLM and VDM, which are two different architectures. There may be transformer-based video LLMs available, such as VideoPoet and VAR, among others. I am sure there are also open-sourced alternatives that are more suitable for these comparisons. 3. The ablat
- Novel Hypothesis and Reframing: The paper's core strength is its originality in reframing VDMs as general problem-solvers rather than just generators. The hypothesis that spatiotemporal inductive biases are key to visual intelligence is a significant and insightful contribution. - Focus on Data Efficiency: The evaluation wisely focuses on skill acquisition efficiency instead of just final SOTA performance. This provides much deeper evidence for the VDM's superior learning properties in low-dat
1. Fundamental Asymmetry in Task Representation and Modality: The comparison's fairness is highly questionable due to a core mismatch in task modalities. (1) The LLM must perform a text-to-text translation on JSON-serialized grids , while the VDM performs a direct pixel-to-pixel mapping. These two representations have fundamentally different information densities, processing complexities, and inherent difficulties. (For example, given a 5x5 grid structure, VDM needs to process an image of 256x2
1. The authors have curated a set of interesting visual tasks to benchmark the spatial reasoning capacity of VDMs and LLMs from cellular automata to visual games. 2. The ARC-AGI results are quite novel and timely and highlight drawbacks of current LLMs.
1. It seems all the visual tasks require spatial reasoning, not spatio-temporal reasoning, which begs the question why not evaluate image diffusion models as well instead of video diffusion models where the uathors practically discard the temporally intermediate frames generated by the model, essentially not using/evaluating the temporal reasoning capacity of these models. 2. The data efficiency plots compare cog-x with qwen without controlling for pre-training FLOPs/data-volume. This is a very
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
