Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
Roussel Desmond Nzoyem, Mauro Comi

TL;DR
NOVA introduces a novel world modeling framework using implicit neural representations that enables efficient, interpretable, and disentangled video modeling without heavy decoders, facilitating controllable forecasting and editing.
Contribution
The paper presents NOVA, a structured INR-based world model that disentangles scene components and renders representations analytically, reducing computational costs and improving interpretability.
Findings
NOVA achieves strong controllable forecasting on challenging datasets.
The model can disentangle background, foreground, and motion without auxiliary losses.
NOVA operates efficiently on a single consumer GPU with approximately 40 million parameters.
Abstract
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
