Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning
Yuting Tang, Xin-Qiang Cai, Jing-Cheng Pang, Qiyu Wu, Yao-Xiang Ding,, and Masashi Sugiyama

TL;DR
This paper introduces a new framework and model for reinforcement learning from complex, non-Markovian delayed rewards, moving beyond traditional assumptions that rewards are simple sums of step rewards, and demonstrates improved performance on locomotion tasks.
Contribution
The paper proposes a novel modeling framework and a transformer-based architecture, CoDeTr, for handling composite delayed rewards without assuming Markovian properties, advancing RL reward modeling.
Findings
CoDeTr outperforms baseline methods on locomotion tasks.
It effectively identifies key steps contributing to delayed rewards.
It accurately predicts environment feedback from complex reward structures.
Abstract
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent's performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovation Diffusion and Forecasting · Supply Chain and Inventory Management
MethodsLinear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam
