Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning
Manuel Goul\~ao, Arlindo L. Oliveira

TL;DR
This paper investigates pretraining Vision Transformers with self-supervised methods for reinforcement learning, emphasizing the importance of temporal relations and demonstrating improved data efficiency and richer representations in Atari environments.
Contribution
It introduces a temporal order verification task to enhance self-supervised pretraining of Vision Transformers for RL, leading to better representations and performance.
Findings
Self-supervised pretraining improves RL data efficiency.
Temporal order verification enhances representation quality.
Pretrained encoder yields richer, more focused attention maps.
Abstract
The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. To show the importance of the temporal dimension in this context we propose an extension of VICReg to better capture temporal relations between observations by adding a temporal order verification task. Our results show that all methods are effective in learning useful representations and avoiding representational collapse for observations from Atari Learning Environment (ALE) which leads to improvements in data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Label Smoothing
