SegDAC: Visual Generalization in Reinforcement Learning via Dynamic Object Tokens
Alexandre Brown, Glen Berseth

TL;DR
SegDAC introduces a segmentation-driven approach with dynamic object tokens for visual reinforcement learning, significantly enhancing generalization to visual changes while maintaining sample efficiency.
Contribution
It proposes a novel method that uses variable-length object tokens and segmentation to improve visual generalization in RL without auxiliary losses.
Findings
SegDAC outperforms prior methods by up to 88% on challenging tasks.
It maintains sample efficiency comparable to state-of-the-art methods.
Both segment positional encoding and variable-length tokens are crucial for performance.
Abstract
Visual reinforcement learning policies trained on pixel observations often struggle to generalize when visual conditions change at test time. Object-centric representations are a promising alternative, but most approaches use fixed-size slot representations, require image reconstruction, or need auxiliary losses to learn object decompositions. As a result, it remains unclear how to learn RL policies directly from object-level inputs without these constraints. We propose SegDAC, a Segmentation-Driven Actor-Critic that operates on a variable-length set of object token embeddings. At each timestep, text-grounded segmentation produces object masks from which spatially aware token embeddings are extracted. A transformer-based actor-critic processes these dynamic tokens, using segment positional encoding to preserve spatial information across objects. We ablate these design choices and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
