VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer; Uriel Singer; Amit Zohar; Yuval Kirstain; Adam Polyak; Yaniv Taigman; Lior Wolf; Shelly Sheynin

arXiv:2502.02492·cs.CV·May 27, 2025

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin

PDF

Open Access 1 Video

TL;DR

VideoJAM introduces a joint appearance-motion framework that improves motion coherence and visual quality in video generation by learning a combined representation and guiding generation with dynamic motion predictions.

Contribution

It presents a novel framework that integrates appearance and motion learning, with a new inference guidance mechanism, applicable to any video model without data or architecture modifications.

Findings

01

Achieves state-of-the-art motion coherence in video generation.

02

Enhances visual quality of generated videos.

03

Applicable to various models with minimal changes.

Abstract

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models· slideslive

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation