On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL

Valerio Belcamino; Nicholas Attolino; Alessio Capitanelli; Fulvio Mastrogiovanni

arXiv:2601.14456·cs.AI·January 22, 2026

On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL

Valerio Belcamino, Nicholas Attolino, Alessio Capitanelli, Fulvio Mastrogiovanni

PDF

Open Access

TL;DR

This paper investigates the generalization gap in large language models for planning tasks, revealing models rely heavily on domain-specific patterns and struggle to transfer knowledge across different domains.

Contribution

The study introduces diagnostic interventions to analyze LLM planning failures and demonstrates the limitations of current fine-tuning approaches in achieving cross-domain generalization.

Findings

01

In-domain valid plan rate reaches 82.9%

02

Cross-domain performance drops to 0% on unseen domains

03

Verifier-reward fine-tuning does not improve cross-domain transfer

Abstract

Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Machine Learning in Healthcare · Topic Modeling