ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards
Fanxing Li, Fangyu Sun, Tianbao Zhang, Danping Zou

TL;DR
ABPT is a novel method that improves quadrotor control training by reducing gradient bias in partially differentiable reward settings, leading to faster convergence and higher rewards.
Contribution
ABPT introduces a new approach combining 0-step and N-step returns to mitigate gradient bias in BPTT with partially differentiable rewards.
Findings
ABPT converges faster than existing algorithms.
ABPT achieves higher rewards in quadrotor tasks.
ABPT is effective in both real-world and simulation environments.
Abstract
Quadrotor control policies can be trained with high performance using the exact gradients of the rewards to directly optimize policy parameters via backpropagation-through-time (BPTT). However, designing a fully differentiable reward architecture is often challenging. Partially differentiable rewards will result in biased gradient propagation that degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to encourage exploration during training. We evaluate ABPT on four representative quadrotor flight tasks \li{in both…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper addresses an important problem of non-differentiable rewards, which are very common in real life applications. 2. The modification introduced uses components already used in actor critic algorithms, thus don't require training additional networks as far as I can understand. However the cost of training the existing components can change (see below).
1. The paper needs experiments to show that adding the 0-step return indeed decreases "bias" of the gradient, as this is one of the central claims of the paper. One way to do this can be to take a very sharp but differentiable relaxation of the non differentiable reward components and then doing Monte Carlo study empirically analyzing the distance between gradients from this relaxed objective and the (i) the ABPT n step + 0 step return (ii) only the n step return. 2. The paper needs to go in mor
1. The motivation and problem of learning under non-differentiable rewards is highly relevant for real-world robotic control. 2. The idea of combining 0-step and N-step returns with a replay buffer is novel and potentially valuable. 3. Simulation results are promising and suggest the method can stabilize learning in challenging reward environments.
However, several key concerns remain that prevent the paper from making a strong contribution to the field. The following points are meant to guide a revised version of the work: 1. **Reward Function Design and Realism:** While the paper focuses on non-differentiable rewards (e.g., binary success/failure), in practice, quadrotor control tasks typically employ hybrid reward functions - combining sparse binary signals with dense, differentiable terms (e.g., position error, angular velocity penalt
1. Originality - The paper does combine several known techniques, but its originality lies in 2 areas: The problem formulation itself where it is being framed as a Biased Gradient problem. Second, the paper does make it clear how previous FOG methods like SHAC, SAPO, and AHAC that incorporate terminal-only value function fails at preventing gradient bias unlike how the 0-step value gradient of ABPT does 2. Quality - The quality of the paper is quite high with experimental design being it's stron
1. The paper's most critical weakness is a fundamental contradiction between its premise and its own ablation study :- The core hypothesis is that the 0-step Q-gradient (QG), from a "well trained" critic, "amends" the biased First-Order-Gradient (FOG). However, the ablation study (Figure 6) and the authors' own analysis state that for the 'Racing' task, the critic is "under-fitting" and "deteriorates" training, and that entropy stabilization is the dominant factor for performance, not the 0-step
The problem statement is clear and well-motivated. Although contact dynamics are a more common source of non-differentiability in policy optimization, "indicator function" type rewards are also important. The algorithm idea is natural and is worth exploring. The extra experiments in the appendix C give some nice fine-grained detail into the influence of hyperparameters and reward function design in the quadrotor settings.
- Line 128-129: Makes it sound like the gradient explosion/vanishing and instability are direct consequences of the dynamics being smooth. That is not true - the root of the problem is BPTT itself, i.e. recursive application of the dynamics function. Both smooth and nonsmooth dynamics can have those problems. - Equation (1): A reader not already familiar with advantage functions and policy gradients will not be able to understand $A^{\pi_\theta}$. - Line 169-170: This estimator is essentially De
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Quantum Computing Algorithms and Architecture · Quantum Mechanics and Applications
MethodsEntropy Regularization
