Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning
Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey, Gang Wang

TL;DR
This paper introduces OC-STORM, an object-centric model-based reinforcement learning framework that uses minimal annotations to improve sample efficiency and dynamics prediction in complex visual environments.
Contribution
It presents a novel OC-MBRL approach that leverages pretrained segmentation for object representations, enhancing sample efficiency without extensive labeling.
Findings
Outperforms STORM baseline on Atari 100k
Achieves state-of-the-art in Hollow Knight boss fights
Demonstrates effective object tracking with minimal annotations
Abstract
While deep reinforcement learning (RL) from pixels has achieved remarkable success, its sample inefficiency remains a critical limitation for real-world applications. Model-based RL (MBRL) addresses this by learning a world model to generate simulated experience, but standard approaches that rely on pixel-level reconstruction losses often fail to capture small, task-critical objects in complex, dynamic scenes. We posit that an object-centric (OC) representation can direct model capacity toward semantically meaningful entities, improving dynamics prediction and sample efficiency. In this work, we introduce OC-STORM, an object-centric MBRL framework that enhances a learned world model with object representations extracted by a pretrained segmentation network. By conditioning on a minimal number of annotated frames, OC-STORM learns to track decision-relevant object dynamics and…
Peer Reviews
Decision·ICLR 2026 Poster
- The method integrates few-shot object features from SAM2/Cutie into a spatial–temporal world model (Transformer/RNN backbones), with clean modality separation (object tokens + visual token) and categorical VAE discretization; the training and architecture are well specified. - On Atari-100k, object-centric variants outperform baselines (e.g., Cutie-OC-STORM reaches HNS mean 134.8% vs. STORM 114.2%, median 43.8% vs. 42.5%); Hollow Knight learning curves show faster convergence on harder bosse
- Comparative scope. Core comparisons are mainly within-framework ablations (STORM/DreamerV3 variants); external SOTA world-model baselines (e.g., diffusion/tokenization variants) and broader agent baselines are deferred or absent in the main text, and Hollow Knight lacks standardized settings—making cross-paper claims harder to calibrate. - Annotation/K configuration burden. The user-set K (objects) and handful of annotated frames (≈6–12) are reasonable but the human-time budget and sensitivi
The idea is novel in using few-shot labelled object masks to get (pre-trained) object representations, which is then used to train a world model for policy control. The paper presents clearly and is of high quality. It presents positive results against reasonable baselines across a number of environments (atari100k, hollow knight, metaworld). It reads as a comprehensive work that can be informative for future works in object centric RL to build on. The ablations are comprehensive. The authors
The major weakness of this method is the effort vs. gains trade-off for using this object-centric representation. To use OC-STORM, the user must first generate 6-12 frames of object mask labels for each environment they may wish to run. The gain from doing this, based on the paper, seems to be _mainly_ about better _sample efficiency_. One could argue that instead of going through the effort of labelling, the user can also (i) run the alternative methods longer to get similar performance, or (ii
I appreciate the idea of adding object-based inductive biases into world modelling, this allows to maintain object consistency and tracking for all objects especially smaller ones. I like that the authors opted for wide evaluation suite going beyond atari to hollow knight and continuous control.
I think the paper would benifit from comparing against unsupervised object-centric representation learning baseline such as slot attention (https://arxiv.org/abs/2006.15055) and slotformer (https://arxiv.org/abs/2210.05861). The main claim of the paper is that object based vectors help world modelling but it is not clear whether there is something special in vectors provides by SAM2/Cutie or even unsupervised methods can also help. Also world modelling has been a focus in the unsupervised object
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsFocus
