Fast Value Tracking for Deep Reinforcement Learning

Frank Shih; Faming Liang

arXiv:2403.13178·stat.ML·March 21, 2024·1 cites

Fast Value Tracking for Deep Reinforcement Learning

Frank Shih, Faming Liang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LKTD, a scalable sampling algorithm based on Kalman filtering and SGMCMC, enabling uncertainty quantification in deep reinforcement learning for more robust decision-making.

Contribution

It presents a novel, scalable sampling method for deep RL that quantifies uncertainty and converges to a stationary distribution, enhancing robustness.

Findings

01

LKTD efficiently samples from the posterior of neural network parameters.

02

The algorithm's convergence to a stationary distribution is theoretically proven.

03

Uncertainty quantification improves policy robustness during training.

Abstract

Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interacts with their environment. However, existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards, neglecting the stochastic dynamics of agent-environment interactions and the critical role of uncertainty quantification. Our research leverages the Kalman filtering paradigm to introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The paper introduces the limitations of existing reinforcement learning algorithms that overlook the stochastic nature of the agent-environment interaction system. To address this, the authors propose a novel algorithm called Langevinized Kalman Temporal-Difference (LKTD) that leverages the Kalman filtering paradigm to draw samples from the posterior distribution of deep neural network parameters. The LKTD algorithm allows for quantifying uncertainties associated with the value function and mode

Weaknesses

1. Comparison with existing posterior sampling value-based algorithms for exploration is missing. Say ensemble sampling, Bootstrapped DQN or HyperDQN. 2. It is better to translate the theoretical guarantee to regret bound. 3. Comparison for uncertainty quantification in deep neural network is missing. 4. The empirical performance could be demonstrated through a wider range of benchmark problems, e.g. Arcade Learning benchmarks or behaviour suite.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Existing RL algorithms often treat the value function as a deterministic entity, overlooking the system's inherent stochastic nature. The Kalman Temporal Difference (KTD) framework in RL treats the value function as a random variable, aiming to track the policy learning process by formulating RL as a state space model. However, KTD-based techniques employing linearization methods become inefficient with high-dimensional models. To improve the computational efficiency in deep RL, this paper refo

Weaknesses

1. The uncertainty quantification for Deep RL is not restricted to Kalman TD frameworks. In the literature of Bayesian RL, we do have works that adopt other approximate sampling techniques, e.g. ensembling in Bootstrapped DQN (Osband et.al. 2016), "Deep Exploration via Randomized Value Functions" (Osband et.al. 2019), and RLSVI with general function approximation (Ishfaq et.al., 2021). Although the above methods do not adopt MCMC methods as approximate sampling scheme, could authors compare the

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

This paper focuses on an important problem of RL: how to model the uncertainty during the interaction with the environment. The paper proposed a new sampling framework for RL. The paper also gives a systematic theoretical analysis of the convergence of the proposed algorithm under the general nonlinear setting.

Weaknesses

The writing of the paper is quite unclear to me. Some presentations need to be further clarified. - I don't quite understand why the title of the paper is "fast value tracking". The term "tracking" is only mentioned in the introduction without clearly explaining it. Why is it "fast"? - eq. 2, introducing $\pi$ seems lack motivation to me. Why it can tackle the issues suffered by KTD. Some terms need further explanation and clarification. - Please clarify what "stage" means. - $x_t$ den

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics