TL;DR
This paper explores enabling reinforcement learning agents to reason about their computational costs, leading to more efficient agents that perform better and use less compute in complex environments.
Contribution
It introduces a method for agents to reason about and control their computation, improving efficiency and performance without additional training resources.
Findings
Agents that reason about compute outperform in 75% of games.
These agents use three times less compute on average.
Efficiency gains are analyzed across individual games.
Abstract
While reinforcement learning agents can achieve superhuman performance in many complex tasks, they typically do not become more computationally efficient as they improve. In contrast, humans gradually require less cognitive effort as they become more proficient at a task. If agents could reason about their compute as they learn, could they similarly reduce their computation footprint? If they could, we could have more energy efficient agents or free up compute cycles for other processes like planning. In this paper, we experiment with showing agents the cost of their computation and giving them the ability to control when they use compute. We conduct our experiments on the Arcade Learning Environment, and our results demonstrate that with the same training compute budget, agents that reason about their compute perform better on 75% of games. Furthermore, these agents use three times…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper’s methods are concise, and the results make sense. I believe the results can be easily replicated.
The paper positions itself about investigating algorithms that reason about their own computation. However, the method presented involves a straightforward modification to the reward given by the MDP. Therefore, this paper is related to reward shaping. However, relevant literature from this field is not mentioned and, therefore, the paper does not position itself within the relevant literature.
1. The setup is clearly stated, and the overall flow is easy to follow. I enjoyed reading this paper a lot, and oftentimes I find answers to my questions/confusions lying just a few sentences away. 2. The idea of modeling deliberate control of computing budget with options under different frequencies is quite novel, and it turns out to be very effective. 3. The experiment evaluation is thorough and convincing.
If I have to say sth here, the only thing I would say is if the authors can show similar results on some larger tasks like VLA, LLM fine-tuning, it would make the work perfect.
- Clean formalization and direct application to any value based rl method - Experimental results on all atari environments, showing less average decisions and higher HNS compared to existing baselines. - Analysis of decision rate change intra episode is interesting and novel.
- Compute is only accrued at decision steps, which does not cover simulator overhead/option execution. Since the proposed agent might see more frames/episodes than the baseline agent during decision-making, this may confound the reported gains. - The framework seems to heavily rely on the existence of temporally extended options. Given the prevalent use of action repeat / frame skip in the atari literature, this narrowly makes sense. In an arbitrary RL environment however its not obvious to me t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
