Improving Large Language Model Planning with Action Sequence Similarity

Xinran Zhao; Hanie Sedghi; Bernd Bohnet; Dale Schuurmans; Azade Nova

arXiv:2505.01009·cs.AI·May 5, 2025

Improving Large Language Model Planning with Action Sequence Similarity

Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, Azade Nova

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces GRASE-DC, a novel exemplar selection method based on action sequence similarity that significantly enhances large language model planning performance across various tasks.

Contribution

It proposes a new exemplar sampling and filtering approach leveraging plan action sequence similarity, improving LLM planning accuracy and generalization.

Findings

01

GRASE-DC improves planning accuracy by up to 40 points.

02

It reduces the number of exemplars needed by 27.3%.

03

Performance boosts are consistent across different LLMs and benchmarks.

Abstract

Planning is essential for artificial intelligence systems to look ahead and proactively determine a course of actions to reach objectives in the virtual and real world. Recent work on large language models (LLMs) sheds light on their planning capability in various tasks. However, it remains unclear what signals in the context influence the model performance. In this work, we explore how to improve the model planning capability through in-context learning (ICL), specifically, what signals can help select the exemplars. Through extensive experiments, we observe that commonly used problem similarity may result in false positives with drastically different plans, which can mislead the model. In response, we propose to sample and filter exemplars leveraging plan side action sequence similarity (AS). We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

1. The proposed approach is both intuitive and shows good empirical performance as it selects exemplars based on AS similarity, providing the similar types of exemplars as the test task. 2. The empirical evaluation is extensive, spanning four PDDL tasks and a natural language planning task, and it tests the method across different base models, showcasing the robustness of the approach.

Weaknesses

1. The effectiveness of GRASE is highly dependent on the quality of initial plans generated by the LLM with randomly selected exemplars; poor initial plans can lead to compromised AS-based exemplar selection. 2. For setups with validator access, a baseline comparison with rejection sampling could improve the analysis. E.g., under a similar validator query budget, the validator can be used to reject the invalid plans and select the better plan generated by the approach with random exemplars.

Reviewer 02Rating 8Confidence 4

Strengths

- **Originality**: This paper has good originality in that it proposes to focus on action sequence similarity instead of the traditional criteria based on the semantic similarity between task descriptions when performing example selection for LLM in-context learning of planning tasks. - **Quality**: This paper has high overall quality. Most of the steps in the proposed GRASE-DC pipeline are very clearly described and discussed in the methodology section, and Section 3 as well as the appendix als

Weaknesses

1. In the formula on Line 141, shouldn’t there be a ‘| |’ symbol around ‘LCAS(A_i, A_j)’? According to the previous description, LCAS(A_i, A_j) is a sequence, not a number. 2. The notation and definition of the core concept ‘Action Sequence Similarity’ should be defined more clearly and strictly in the paper. Currently some mentions of AS are a little vague and confusing.

Reviewer 03Rating 5Confidence 4

Strengths

- This paper breaks away from traditional task similarity and instead utilizes action sequence similarity to select exemplars for ICL, enhancing the model's planning performance, which is simple yet effective. - The GRASE-DC shows great generalization performance on more complex tasks. - Endeavors have been made to pursue the efficiency of the exemplar selection process.

Weaknesses

1. The method of using clustering algorithms to improve the relevance and diversity of the selected exemplars has been proposed by other paper [1] before. 2. The selected evaluation datasets lack some real world simulated tasks such as ALFWorld, Mind2Web, ScienceWorld, etc. 3. Though the VAL mechanism is referenced from other paper, the authors are suggested to introduce it briefly in the paper to enhance the readability as VAL appears frequently in the paper. [1] Automatic Chain of Thought Pro

Videos

Improving Large Language Model Planning with Action Sequence Similarity· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling