# Formal Policy Learning from Demonstrations for Reachability Properties

**Authors:** Hadi Ravanbakhsh, Sriram Sankaranarayanan, Sanjit A. Seshia

arXiv: 1903.00589 · 2019-03-05

## TL;DR

This paper presents a formal policy learning framework from demonstrations using a counterexample-guided loop, enabling the synthesis of verified, fast policies for robotic control from complex MPC demonstrators.

## Contribution

It introduces a novel iterative learning method combining verification and demonstration, extending MPC with gradient-based constraints for policy synthesis.

## Key findings

- Successfully applied to two case studies
- Generated policies are formally verified and faster than original MPCs
- Demonstrates effective learning from complex nonlinear controllers

## Abstract

We consider the problem of learning structured, closed-loop policies (feedback laws) from demonstrations in order to control under-actuated robotic systems, so that formal behavioral specifications such as reaching a target set of states are satisfied. Our approach uses a ``counterexample-guided'' iterative loop that involves the interaction between a policy learner, a demonstrator and a verifier. The learner is responsible for querying the demonstrator in order to obtain the training data to guide the construction of a policy candidate. This candidate is analyzed by the verifier and either accepted as correct, or rejected with a counterexample. In the latter case, the counterexample is used to update the training data and further refine the policy.   The approach is instantiated using receding horizon model-predictive controllers (MPCs) as demonstrators. Rather than using regression to fit a policy to the demonstrator actions, we extend the MPC formulation with the gradient of the cost-to-go function evaluated at sample states in order to constrain the set of policies compatible with the behavior of the demonstrator. We demonstrate the successful application of the resulting policy learning schemes on two case studies and we show how simple, formally-verified policies can be inferred starting from a complex and unverified nonlinear MPC implementations. As a further benefit, the policies are many orders of magnitude faster to implement when compared to the original MPCs.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.00589/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1903.00589/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/1903.00589/full.md

---
Source: https://tomesphere.com/paper/1903.00589