On the Effect of Pre-training for Transformer in Different Modality on Offline Reinforcement Learning
Shiro Takagi

TL;DR
This paper empirically examines how pre-training on different modalities like language and vision impacts the fine-tuning of Transformer models in offline reinforcement learning, revealing modality-specific effects on representations and performance.
Contribution
It provides new insights into how modality-specific pre-training influences Transformer representations and their effectiveness in offline reinforcement learning tasks.
Findings
Pre-trained Transformers develop different internal representations from randomly initialized models.
Pre-trained models change less during fine-tuning, with large gradients affecting performance.
Language pre-training enables the model to learn efficiently even without context, suggesting it captures context-like information.
Abstract
We empirically investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks. Analysis of the internal representation reveals that the pre-trained Transformers acquire largely different representations before and after pre-training, but acquire less information of data in fine-tuning than the randomly initialized one. A closer look at the parameter changes of the pre-trained Transformers reveals that their parameters do not change that much and that the bad performance of the model pre-trained with image data could partially come from large gradients and gradient clipping. To study what information the Transformer pre-trained with language data utilizes, we fine-tune this model with no context provided, finding that the model learns efficiently even without context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Adam · Linear Layer · Dense Connections · Residual Connection · Byte Pair Encoding · Position-Wise Feed-Forward Layer
