Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for   Reinforcement Learning

Yuting Tang; Xin-Qiang Cai; Jing-Cheng Pang; Qiyu Wu; Yao-Xiang Ding,; and Masashi Sugiyama

arXiv:2410.20176·cs.LG·October 29, 2024

Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

Yuting Tang, Xin-Qiang Cai, Jing-Cheng Pang, Qiyu Wu, Yao-Xiang Ding,, and Masashi Sugiyama

PDF

Open Access

TL;DR

This paper introduces a new framework and model for reinforcement learning from complex, non-Markovian delayed rewards, moving beyond traditional assumptions that rewards are simple sums of step rewards, and demonstrates improved performance on locomotion tasks.

Contribution

The paper proposes a novel modeling framework and a transformer-based architecture, CoDeTr, for handling composite delayed rewards without assuming Markovian properties, advancing RL reward modeling.

Findings

01

CoDeTr outperforms baseline methods on locomotion tasks.

02

It effectively identifies key steps contributing to delayed rewards.

03

It accurately predicts environment feedback from complex reward structures.

Abstract

Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent's performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovation Diffusion and Forecasting · Supply Chain and Inventory Management

MethodsLinear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam