Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations
William Sharpless, Dylan Hirsch, Sander Tonkens, Nikhil Shinde, Sylvia Herbert

TL;DR
This paper introduces two novel Hamilton-Jacobi-Bellman based value functions for dual-objective reinforcement learning, enabling explicit satisfaction of complex constraints without reward engineering, and demonstrates their effectiveness with a new PPO variant.
Contribution
It extends Hamilton-Jacobi RL to handle dual objectives with explicit Bellman forms, providing a tractable approach for constrained RL problems.
Findings
Outperforms baselines in success and safety metrics
Produces distinct behaviors from previous methods
Effective in safe-arrival and multi-target tasks
Abstract
Hard constraints in reinforcement learning (RL) often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: 1) the Reach-Always-Avoid (RAA) problem -- of achieving distinct reward and penalty thresholds -- and 2) the Reach-Reach (RR) problem -- of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context via decomposition. Specifically, we prove that the RAA and RR problems may be rewritten as compositions of previously studied HJ-RL problems. We leverage our…
Peer Reviews
Decision·ICLR 2026 Poster
- The RR and RAA problems are two interesting dual-objective satisfactory problems, and the paper reveals their unique structures that require specialized treatment. - The paper presents a solid theoretical analysis by deriving the Bellman equations for the RR and RAA problems and elucidating their connections to the previously studied reach, avoid, and reach–avoid problems. - Empirical results across multiple environments show that the proposed PPO-based method outperforms baseline approaches
I do not have major concerns about the paper (possibly due to limited familiarity with the related literature). Some minor points are as follows: - The paper considers a deterministic transition function. It is unclear whether the proposed method can be extended to the stochastic transition case. - For stochastic policies, it remains uncertain whether the proposed PPO-based update can guarantee convergence to an optimal policy.
- **Originality and significance**: moderate. The problems studied in this paper are classical in logic terms, but the decomposition is novel and practically attractive. - **Quality and correctness**: the reductions using min/max identities and distributivity/commutativity under deterministic transition and policy assumptions (relaxed later) seem sound and reasonable for the stated setting. Proofs were not carefully checked. - **Clarity**: This paper is clear, easy to follow, and contextualized
## Formal relation to temporal logic I believe the value function decompositions are based on some algebra/logic implicitly. It would help readers if the paper states the specs more explicitly, potentially using temporal logic, even if the proposed method does not explicitly use automata. My feeling is that this paper is just doing quantitative semantics for temporal modal operators $F$ **f**inally, $G$ **g**lobally/always, and $U$ **u**ntil. Let $r, p, q$ be temporal propositions. Then, - R
1. The decompositions are stated cleanly, proved in the appendix, and tied directly to implementable Bellman operators; the need for history and the exact augmentation are justified, not hand-waved. 2. Many safe / task-spec RL problems really are “reach X and stay safe” or “reach X then Y”; having explicit Bellman forms and a working PPO variant for those is useful, especially since baselines mostly get only “partial success.”
1. Assumptions are narrow: Main theory is for deterministic, finite MDPs, yet real tasks are stochastic/continuous, but the paper only gives a heuristic stochastic variant (SRABE) without matching guarantees. A short discussion of what breaks in stochastic dynamics is needed. 2. State augmentation cost: The proposed augmentation grows the state with running max/min signals; for higher-dim tasks (real robots, multi-goal specs) this could be heavy, and the paper doesn’t study scalability. 3. The p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Formal Methods in Verification
