Generative Learning of Differentiable Object Models for Compositional Interpretation of Complex Scenes
Antoni Nowinowski, Krzysztof Krawiec

TL;DR
This paper extends a scene interpretation autoencoder to handle multiple objects, improving decomposition and reconstruction quality through novel training modes and a new benchmark, advancing compositional scene understanding.
Contribution
The study introduces an extended DVP model capable of multi-object scene interpretation, with new training strategies and a more complex benchmark for evaluation.
Findings
Outperforms baselines in reconstruction quality
Better decomposition of overlapping objects
Enhanced training stability and efficiency
Abstract
This study builds on the architecture of the Disentangler of Visual Priors (DVP), a type of autoencoder that learns to interpret scenes by decomposing the perceived objects into independent visual aspects of shape, size, orientation, and color appearance. These aspects are expressed as latent parameters which control a differentiable renderer that performs image reconstruction, so that the model can be trained end-to-end with gradient using reconstruction loss. In this study, we extend the original DVP so that it can handle multiple objects in a scene. We also exploit the interpretability of its latent by using the decoder to sample additional training examples and devising alternative training modes that rely on loss functions defined not only in the image space, but also in the latent space. This significantly facilitates training, which is otherwise challenging due to the presence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
