Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning
Vikram Krishnamurthy, Luke Snow

TL;DR
This paper introduces a novel passive Langevin-based algorithm for adaptive inverse reinforcement learning that employs Malliavin calculus to efficiently estimate counterfactual gradients, overcoming traditional Monte Carlo limitations.
Contribution
It develops a Malliavin calculus-based method to accurately and efficiently estimate counterfactual gradients in adaptive IRL, enabling improved passive learning algorithms.
Findings
The proposed method achieves standard estimation rates for counterfactual gradients.
It reformulates counterfactual conditioning as a ratio of unconditioned expectations.
The algorithm exploits Malliavin derivatives and Skorohod integrals for efficient gradient estimation.
Abstract
Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses. Adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
