TL;DR
This paper introduces Identifiable Token Correspondence (ITC), a method to improve token-based transformer world models in visual reinforcement learning by ensuring token persistence across frames, leading to state-of-the-art results.
Contribution
ITC formulates next-frame token prediction as a structured assignment problem, enhancing temporal consistency without altering existing transformer architectures.
Findings
Achieves state-of-the-art performance on 4 benchmarks.
Significantly improves scores on the Craftax-classic benchmark.
Enhances token persistence and consistency in long-horizon rollouts.
Abstract
Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
