Utilizing Evolution Strategies to Train Transformers in Reinforcement Learning
Maty\'a\v{s} Lorenc, Roman Neruda

TL;DR
This paper demonstrates that evolution strategies can effectively train large transformer-based agents in reinforcement learning tasks, achieving high performance in complex environments like MuJoCo and Atari games.
Contribution
It introduces the application of evolution strategies to train transformer architectures in reinforcement learning, showing their effectiveness on complex, large-scale models.
Findings
Evolution strategies can successfully train transformer-based agents.
High-performing agents were achieved in MuJoCo and Atari environments.
Evolution strategies are a viable black-box optimization method for complex models.
Abstract
We explore the capability of evolution strategies to train an agent with a policy based on a transformer architecture in a reinforcement learning setting. We performed experiments using OpenAI's highly parallelizable evolution strategy to train Decision Transformer in the MuJoCo Humanoid locomotion environment and in the environment of Atari games, testing the ability of this black-box optimization technique to train even such relatively large and complicated models (compared to those previously tested in the literature). The examined evolution strategy proved to be, in general, capable of achieving strong results and managed to produce high-performing agents, showcasing evolution's ability to tackle the training of even such complex models.
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is very relevant. The motivation is clear. The authors want to understand if ES can handle modern, large RL architectures. This is timely and relevant. The experimental setup shows solid engineering effort. They re-implemented OpenAI-ES and carefully described the main design decisions. The paper presents an interesting observation. ES initially weakens a pretrained model before improving it. The authors argue that ES first improves robustness in parameter space. This is a useful in
The contribution feels more like a feasibility study than a new method or strong theoretical insight. The paper could benefit from clearer framing about what new knowledge is gained beyond “ES works on transformers.” The baseline using TD3 with a Decision-Transformer-style model is not strong. It mostly shows that standard RL fails here. Stronger or more relevant baselines (e.g., modern online sequence models or RvS approaches) would help. The return-to-go signal essentially gets ignored when
This paper covers an interesting question of using evolutionary strategies to train transformer architectures in an RL setup. Given the recent interest in transformer architecture, whether in decision making (Decision Transformer) or reasoning (LLMs and Reasoning Models), this sheds lights into an interesting research direction. This paper also includes some promising results and insights, providing a useful starting point for future research into evolutionary strategies for transformer architec
While this paper offers a promising vantage point into future research, I think there are several improvements that can be made to offer more to the RL community. First, while the experiment only includes a dotted-line comparison with TD3, I think it would be better to include a more though comparison with more RL algorithms (PPO, SAC, etc) as well as their resource usage plots. While evolutionary strategies often consume a lot more computational resource than more traditional RL algorithms, ma
The key strength of this paper lies in its demonstration that evolution strategies, despite being simple and gradient-free, can effectively train large transformer-based reinforcement learning models like the Decision Transformer. This highlights the scalability, robustness, and strong parallelization potential of evolution strategies, extending their applicability to more complex neural architectures previously dominated by gradient-based methods.
While the paper is clearly written and experimentally careful, its novelty is quite limited. The main claim that evolution strategies can train transformer-based reinforcement learning agents is not fundamentally new. Evolution strategies have already been shown to scale effectively to large, high-dimensional models, including transformer architectures, across various domains such as natural language processing, neural architecture search, and dynamic scheduling. The present work simply extends
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
