Transformers and Slot Encoding for Sample Efficient Physical World Modelling
Francesco Petri, Luigi Asprino, Aldo Gangemi

TL;DR
This paper introduces a novel neural architecture combining Transformers with slot-attention to improve sample efficiency and object-based scene understanding in world modelling from video data.
Contribution
It presents a new architecture that integrates Transformers with slot-attention, enhancing object-based scene representation and sample efficiency in world modelling tasks.
Findings
Improved sample efficiency over existing methods
Reduced performance variation across training examples
Enhanced object-based scene understanding
Abstract
World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
