Learning from Less: Guiding Deep Reinforcement Learning with Differentiable Symbolic Planning

Zihan Ye; Oleg Arenz; Kristian Kersting

arXiv:2505.11661·cs.AI·May 20, 2025

Learning from Less: Guiding Deep Reinforcement Learning with Differentiable Symbolic Planning

Zihan Ye, Oleg Arenz, Kristian Kersting

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Dylan, a differentiable symbolic planner that integrates human priors into reinforcement learning, improving efficiency, exploration, and generalization in complex tasks.

Contribution

Dylan uniquely combines symbolic planning with RL to shape rewards and generate policies, enabling more efficient learning and better generalization.

Findings

01

Dylan improves RL performance across various tasks.

02

It enables RL agents to learn with fewer interactions.

03

Dylan enhances generalization to unseen tasks.

Abstract

When tackling complex problems, humans naturally break them down into smaller, manageable subtasks and adjust their initial plans based on observations. For instance, if you want to make coffee at a friend's place, you might initially plan to grab coffee beans, go to the coffee machine, and pour them into the machine. Upon noticing that the machine is full, you would skip the initial steps and proceed directly to brewing. In stark contrast, state of the art reinforcement learners, such as Proximal Policy Optimization (PPO), lack such prior knowledge and therefore require significantly more training steps to exhibit comparable adaptive behavior. Thus, a central research question arises: \textit{How can we enable reinforcement learning (RL) agents to have similar ``human priors'', allowing the agent to learn with fewer training interactions?} To address this challenge, we propose…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The core idea of integrating symbolic planning into RL through a differentiable framework is novel and promising. Experiments show that the proposed method is able to generalize to unseen tasks and compose new behaviors without retraining, which is an advantage over traditional RL approaches.

Weaknesses

- [Theoretical analysis] Regarding Dylan as a reward model, it is necessary to provide a theoretical analysis to show that the new reward function doesn't change the optimization objective compared with pure PPO/A2C. - [Single test domain] The experiments are currently limited to the MiniGrid environment suite. Expanding the evaluation to include more diverse and complex environments, especially those with continuous action spaces (for example, robotics manipulation tasks), would demonstrate th

Reviewer 02Rating 4Confidence 2

Strengths

The authors present convincing evidence that DYLAN can improve sample efficiency when used as an auxiliary task on top of existing RL methods. Both the standard and adaptive versions of DYLAN demonstrate a substantial performance boost over the baselines in the tasks shown (Figure 3, Figure 4, Table 1). The method appears to be robust to partial information and adaptive to different strategies, which are valuable properties for practical deployment. The plans generated by DYLAN can improve int

Weaknesses

Though the results indicate that DYLAN can infer the underlying goal from expert demonstrations (Q6), I'm not sure if this necessarily supports the claim that DYLAN has a "unique ability to generalize from demonstrations, a property beyond the scope of existing RL or hierarchical approaches." The ability to infer a goal does not necessarily correspond to the ability to stitch trajectories together. Additionally, existing RL approaches do demonstrate stitching and generalization of demonstrations

Reviewer 03Rating 4Confidence 3

Strengths

This is a well-motivated paper that connects symbolic reasoning with RL in a meaningful way. The motivation is clear: RL struggles with sparse rewards, and symbolic priors can help address that. The method is technically consistent and reasonably well explained. Results on MiniGrid show faster learning and better stability compared to PPO and A2C. The planner’s ability to adapt search strategies (DFS/BFS), compose sub-tasks, and infer goals is impressive and shows flexibility.

Weaknesses

- Missing a formal definition of the setting (e.g., how is a state and goal condition defined?) - The practical impact is limited by the narrow scope of the experiment - The experiments are limited to a small, discrete MiniGrid task; it’s unclear how well DYLAN would scale to continuous control or vision-based domains. - Important baselines are missing, especially Reward Machines or intrinsic motivation approaches (RND, ICM), which deal with similar problems. -There’s no discussion or measuremen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Multimodal Machine Learning Applications