Learning from Less: Guiding Deep Reinforcement Learning with Differentiable Symbolic Planning
Zihan Ye, Oleg Arenz, Kristian Kersting

TL;DR
This paper introduces Dylan, a differentiable symbolic planner that integrates human priors into reinforcement learning, improving efficiency, exploration, and generalization in complex tasks.
Contribution
Dylan uniquely combines symbolic planning with RL to shape rewards and generate policies, enabling more efficient learning and better generalization.
Findings
Dylan improves RL performance across various tasks.
It enables RL agents to learn with fewer interactions.
Dylan enhances generalization to unseen tasks.
Abstract
When tackling complex problems, humans naturally break them down into smaller, manageable subtasks and adjust their initial plans based on observations. For instance, if you want to make coffee at a friend's place, you might initially plan to grab coffee beans, go to the coffee machine, and pour them into the machine. Upon noticing that the machine is full, you would skip the initial steps and proceed directly to brewing. In stark contrast, state of the art reinforcement learners, such as Proximal Policy Optimization (PPO), lack such prior knowledge and therefore require significantly more training steps to exhibit comparable adaptive behavior. Thus, a central research question arises: \textit{How can we enable reinforcement learning (RL) agents to have similar ``human priors'', allowing the agent to learn with fewer training interactions?} To address this challenge, we propose…
Peer Reviews
Decision·Submitted to ICLR 2026
The core idea of integrating symbolic planning into RL through a differentiable framework is novel and promising. Experiments show that the proposed method is able to generalize to unseen tasks and compose new behaviors without retraining, which is an advantage over traditional RL approaches.
- [Theoretical analysis] Regarding Dylan as a reward model, it is necessary to provide a theoretical analysis to show that the new reward function doesn't change the optimization objective compared with pure PPO/A2C. - [Single test domain] The experiments are currently limited to the MiniGrid environment suite. Expanding the evaluation to include more diverse and complex environments, especially those with continuous action spaces (for example, robotics manipulation tasks), would demonstrate th
The authors present convincing evidence that DYLAN can improve sample efficiency when used as an auxiliary task on top of existing RL methods. Both the standard and adaptive versions of DYLAN demonstrate a substantial performance boost over the baselines in the tasks shown (Figure 3, Figure 4, Table 1). The method appears to be robust to partial information and adaptive to different strategies, which are valuable properties for practical deployment. The plans generated by DYLAN can improve int
Though the results indicate that DYLAN can infer the underlying goal from expert demonstrations (Q6), I'm not sure if this necessarily supports the claim that DYLAN has a "unique ability to generalize from demonstrations, a property beyond the scope of existing RL or hierarchical approaches." The ability to infer a goal does not necessarily correspond to the ability to stitch trajectories together. Additionally, existing RL approaches do demonstrate stitching and generalization of demonstrations
This is a well-motivated paper that connects symbolic reasoning with RL in a meaningful way. The motivation is clear: RL struggles with sparse rewards, and symbolic priors can help address that. The method is technically consistent and reasonably well explained. Results on MiniGrid show faster learning and better stability compared to PPO and A2C. The planner’s ability to adapt search strategies (DFS/BFS), compose sub-tasks, and infer goals is impressive and shows flexibility.
- Missing a formal definition of the setting (e.g., how is a state and goal condition defined?) - The practical impact is limited by the narrow scope of the experiment - The experiments are limited to a small, discrete MiniGrid task; it’s unclear how well DYLAN would scale to continuous control or vision-based domains. - Important baselines are missing, especially Reward Machines or intrinsic motivation approaches (RND, ICM), which deal with similar problems. -There’s no discussion or measuremen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Multimodal Machine Learning Applications
