Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games
Lukas Sch\"afer, Logan Jones, Anssi Kanervisto, Yuhan Cao, Tabish Rashid, Raluca Georgescu, Dave Bignell, Siddhartha Sen, Andrea Trevi\~no Gavito, Sam Devlin

TL;DR
This paper systematically evaluates the effectiveness of pre-trained visual encoders versus end-to-end training for imitation learning in modern video games, demonstrating that pre-trained encoders can improve decision-making and reduce training costs.
Contribution
It provides a comparative analysis of pre-trained versus task-specific visual encoders for imitation learning in modern video games, highlighting the benefits of using pre-trained models.
Findings
Pre-trained encoders like DINOv2 improve decision-making performance.
End-to-end training is effective with low-resolution images and minimal demonstrations.
Pre-trained encoders significantly reduce training costs and complexity.
Abstract
Video games have served as useful benchmarks for the decision-making community, but going beyond Atari games towards modern games has been prohibitively expensive for the vast majority of the research community. Prior work in modern video games typically relied on game-specific integration to obtain game features and enable online training, or on existing large datasets. An alternative approach is to train agents using imitation learning to play video games purely from images. However, this setting poses a fundamental question: which visual encoders obtain representations that retain information critical for decision making? To answer this question, we conduct a systematic study of imitation learning with publicly available pre-trained visual encoders compared to the typical task-specific end-to-end training approach in Minecraft, Counter-Strike: Global Offensive, and Minecraft…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The authors have selected a diverse set of modern video games, including Minecraft, Minecraft Dungeons, and Counter-Strike: Global Offensive, for their experimental studies. This choice reflects a significant step forward from the commonly used Atari games in previous research, providing a more realistic and challenging benchmark for evaluating imitation learning techniques. 2. The paper introduces an innovative approach to imitation learning by leveraging publicly available large vision mode
1. It seems that the task is very simple, such as chopping trees in Minecraft, which is a fairly straightforward task. The author's conclusion is that there is no significant difference between various visual encoders and input image resolutions. However, due to the simplicity of the task, this conclusion is unreliable. Evaluating models like CLIP and DINO on such simple tasks does not effectively demonstrate the differences between modern vision transformers and CNNs. I strongly recommend that
* The paper is well-written and easy to follow. * This paper studies an important problem: the difference of vision encoders in building policy models for decision-making. * The selected environments are three modern video games, which are popular and challenging. To some degree, I believe the conclusions drawn from these environments can be generalized to real-world scenarios.
* **Missing some details.** It is not clear what kinds of image augmentation tricks are used. Why the image augmentation method is specific to the game? Why a pre-trained model (DINOv2) is better than the others? It lacks deep discussions. * **Provides rollout videos for better understanding.** Rollout videos are very helpful for readers to understand the challenges of the environments and the effectiveness of the model. It is strongly recommended to include some videos in the supplementary mat
The authors provide a valuable datapoint to the community for which existing pretrained encoders they may want to initialize their experiments from (seemingly DINO).
Small scope and unsurprising results. This paper is more of a baselines paper comparing existing methods. For a baselines paper, I would expect far more extensive experiments across domains and methods. The domains considered here, while they are “modern video games”, are quite limited. E.g. for Minecraft they only consider the treechop task, which is the most basic thing one can do in Minecraft
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Human Pose and Action Recognition · Reinforcement Learning in Robotics
