A study of first-passage time minimization via Q-learning in heated gridworlds
M.A. Larchenko, P. Osinenko, G. Yaremenko, V.V. Palyulin

TL;DR
This paper investigates how reinforcement learning agents optimize first-passage times in heated gridworlds with uneven noise, revealing biases in common algorithms that impact exploration and performance.
Contribution
It provides a detailed analysis of bias effects in tabular Q-learning, SARSA, Expected SARSA, and Double Q-learning in environments with uneven noise levels.
Findings
High learning rates hinder exploration in high-noise regions.
Low learning rates increase agent presence in high-noise areas.
Bias effects in TD methods are significant for real-world applications.
Abstract
Optimization of first-passage times is required in applications ranging from nanobots navigation to market trading. In such settings, one often encounters unevenly distributed noise levels across the environment. We extensively study how a learning agent fares in 1- and 2- dimensional heated gridworlds with an uneven temperature distribution. The results show certain bias effects in agents trained via simple tabular Q-learning, SARSA, Expected SARSA and Double Q-learning. While high learning rate prevents exploration of regions with higher temperature, low enough rate increases the presence of agents in such regions. The discovered peculiarities and biases of temporal-difference-based reinforcement learning methods should be taken into account in real-world physical applications and agent design.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Distributed Control Multi-Agent Systems · Modular Robots and Swarm Intelligence
MethodsSarsa · Double Q-learning · Expected Sarsa
