Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations

William Sharpless; Dylan Hirsch; Sander Tonkens; Nikhil Shinde; Sylvia Herbert

arXiv:2506.16016·cs.AI·December 5, 2025

Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations

William Sharpless, Dylan Hirsch, Sander Tonkens, Nikhil Shinde, Sylvia Herbert

PDF

Open Access 3 Reviews

TL;DR

This paper introduces two novel Hamilton-Jacobi-Bellman based value functions for dual-objective reinforcement learning, enabling explicit satisfaction of complex constraints without reward engineering, and demonstrates their effectiveness with a new PPO variant.

Contribution

It extends Hamilton-Jacobi RL to handle dual objectives with explicit Bellman forms, providing a tractable approach for constrained RL problems.

Findings

01

Outperforms baselines in success and safety metrics

02

Produces distinct behaviors from previous methods

03

Effective in safe-arrival and multi-target tasks

Abstract

Hard constraints in reinforcement learning (RL) often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: 1) the Reach-Always-Avoid (RAA) problem -- of achieving distinct reward and penalty thresholds -- and 2) the Reach-Reach (RR) problem -- of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context via decomposition. Specifically, we prove that the RAA and RR problems may be rewritten as compositions of previously studied HJ-RL problems. We leverage our…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

- The RR and RAA problems are two interesting dual-objective satisfactory problems, and the paper reveals their unique structures that require specialized treatment. - The paper presents a solid theoretical analysis by deriving the Bellman equations for the RR and RAA problems and elucidating their connections to the previously studied reach, avoid, and reach–avoid problems. - Empirical results across multiple environments show that the proposed PPO-based method outperforms baseline approaches

Weaknesses

I do not have major concerns about the paper (possibly due to limited familiarity with the related literature). Some minor points are as follows: - The paper considers a deterministic transition function. It is unclear whether the proposed method can be extended to the stochastic transition case. - For stochastic policies, it remains uncertain whether the proposed PPO-based update can guarantee convergence to an optimal policy.

Reviewer 02Rating 6Confidence 4

Strengths

- **Originality and significance**: moderate. The problems studied in this paper are classical in logic terms, but the decomposition is novel and practically attractive. - **Quality and correctness**: the reductions using min/max identities and distributivity/commutativity under deterministic transition and policy assumptions (relaxed later) seem sound and reasonable for the stated setting. Proofs were not carefully checked. - **Clarity**: This paper is clear, easy to follow, and contextualized

Weaknesses

## Formal relation to temporal logic I believe the value function decompositions are based on some algebra/logic implicitly. It would help readers if the paper states the specs more explicitly, potentially using temporal logic, even if the proposed method does not explicitly use automata. My feeling is that this paper is just doing quantitative semantics for temporal modal operators $F$ **f**inally, $G$ **g**lobally/always, and $U$ **u**ntil. Let $r, p, q$ be temporal propositions. Then, - R

Reviewer 03Rating 6Confidence 2

Strengths

1. The decompositions are stated cleanly, proved in the appendix, and tied directly to implementable Bellman operators; the need for history and the exact augmentation are justified, not hand-waved. 2. Many safe / task-spec RL problems really are “reach X and stay safe” or “reach X then Y”; having explicit Bellman forms and a working PPO variant for those is useful, especially since baselines mostly get only “partial success.”

Weaknesses

1. Assumptions are narrow: Main theory is for deterministic, finite MDPs, yet real tasks are stochastic/continuous, but the paper only gives a heuristic stochastic variant (SRABE) without matching guarantees. A short discussion of what breaks in stochastic dynamics is needed. 2. State augmentation cost: The proposed augmentation grows the state with running max/min signals; for higher-dim tasks (real robots, multi-goal specs) this could be heavy, and the paper doesn’t study scalability. 3. The p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Formal Methods in Verification