2048: Reinforcement Learning in a Delayed Reward Environment
Prady Saligram, Tanvir Bhathal, Robby Manihani

TL;DR
This paper introduces a distributional multi-step reinforcement learning framework that significantly improves agent performance in the challenging game 2048 with delayed rewards, achieving state-of-the-art scores.
Contribution
The work presents a novel H-DQN algorithm combining distributional learning, dueling architectures, and other techniques, advancing RL in sparse, delayed reward environments.
Findings
H-DQN outperforms standard DQN, PPO, and QR-DQN in 2048.
H-DQN reaches the 2048 tile with a score of 18.21K.
Scaling H-DQN achieves a score of 41.828K and a 4096 tile.
Abstract
Delayed and sparse rewards present a fundamental obstacle for reinforcement-learning (RL) agents, which struggle to assign credit for actions whose benefits emerge many steps later. The sliding-tile game 2048 epitomizes this challenge: although frequent small score changes yield immediate feedback, they often mislead agents into locally optimal but globally suboptimal strategies. In this work, we introduce a unified, distributional multi-step RL framework designed to directly optimize long-horizon performance. Using the open source Gym-2048 environment we develop and compare four agent variants: standard DQN, PPO, QR-DQN (Quantile Regression DQN), and a novel Horizon-DQN (H-DQN) that integrates distributional learning, dueling architectures, noisy networks, prioritized replay, and more. Empirical evaluation reveals a clear hierarchy in effectiveness: max episode scores improve from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning
MethodsQ-Learning · Convolution · Dense Connections · Deep Q-Network · Proximal Policy Optimization
