Expressive Temporal Specifications for Reward Monitoring
Omar Adalat, Francesco Belardinelli

TL;DR
This paper introduces a method using quantitative Linear Temporal Logic to create reward monitors that provide dense, nuanced feedback during reinforcement learning, improving training efficiency and task performance.
Contribution
It proposes a novel, logic-based framework for synthesizing reward monitors that handle non-Markovian properties and dense reward signals, surpassing Boolean monitors.
Findings
Quantitative monitors outperform Boolean monitors in task completion.
They reduce convergence time in reinforcement learning.
The approach is compatible with various algorithms and environments.
Abstract
Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (()) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Formal Methods in Verification · Advanced Software Engineering Methodologies
