Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment

Yizhuo Zhang; Heng Wang; Shangbin Feng; Zhaoxuan Tan; Xinyun Liu; Yulia Tsvetkov

arXiv:2506.00845·cs.LG·August 19, 2025

Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment

Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov

PDF

Open Access 5 Reviews

TL;DR

This paper introduces a post-training alignment approach for LLMs to improve their generalization from synthetic graph data to real-world graph reasoning tasks, achieving significant performance gains.

Contribution

It proposes a novel post-training alignment method using algorithms like GRPO and DPO to enhance LLMs' ability to generalize from synthetic to real-world graph tasks.

Findings

01

Post-training alignment improves performance on 5 datasets by 12.9% on average.

02

Process-based rewards outperform solution-based rewards on synthetic data.

03

Challenges remain in compositionality and explainability of intermediate reasoning steps.

Abstract

Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don't need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph with post-training alignment with synthetic data. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that post-training alignment would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting on synthetic data. We employ post-training alignment algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The manuscript is clearly written with a well-motivated problem statement and strong readability. 2. The authors conduct extensive experiments for the proposed post-training alignment, covering both synthetic and real-world settings, with relatively comprehensive analysis dimensions. 3. The paper offers an in-depth diagnosis of LLM bottlenecks for graph reasoning (e.g., compositionality gaps, multi-step hallucination), which is insightful for future work.

Weaknesses

1. The methodological novelty primarily lies in the reward design and its application-oriented adaptation; overall, the work reads more like an engineering-focused instantiation of existing alignment algorithms (GRPO/DPO) to the graph-reasoning setting rather than a fundamentally new learning paradigm. 2. The experiments lack head-to-head comparisons with external SFT-based baselines (including representative methods cited in the introduction). Evaluations focus on intra-method stages (base/SFT/

Reviewer 02Rating 6Confidence 3

Strengths

- I think the research question is interesting, of how strong the effect of synthetic post-training tasks can be for more real-world reasoning tasks. This is an important question in general, and for graphs in particular. Importantly, the synthetic tasks are fairly easy to generate and verify. - The findings on the relative insignificance of the process-based reward, despite its benefit for synthetic tasks, is interesting. Moreover, the findings on compositionality, i.e., that in many cases the

Weaknesses

- The writing of the main text could be clearer in certain parts. E.g., the two reward strategies should be explained at least by summary in words in the main paper, including that each correct step gets a reward. That was clear to me only after looking at the appendix. Please also see my questions below. As another example, what the relation between the 1-step and multi-step tasks is was only clear after looking at the appendix. - The novelty of the paper is not entirely clear. The evaluation

Reviewer 03Rating 2Confidence 4

Strengths

The experiments are extensive, covering both synthetic and natural datasets, and the analyses demonstrate careful empirical rigor. The use of process-based rewards is commendable, as it reflects a shift from mere output correctness to reasoning faithfulness, a crucial step for trustworthy LLM reasoning.

Weaknesses

1. The evaluated graphs are small. Scalability to larger, common real-world networks is not demonstrated. 2. Not all the tasks that could be modeled graph should be modeled as graph. These tasks can be addressed by LLMs and why do we need to model them as graph? 3. Although the paper claims to study “graph reasoning,” it does not address any explicit graph problem in the conventional sense (e.g., shortest-path search, graph coloring, or subgraph matching). All evaluations are performed on lang

Reviewer 04Rating 2Confidence 4

Strengths

1. I agree with the claim that using LLM to address tasks like the shortest path is not meaningful. I'm glad that this paper addresses the issue. 2. The idea to use RL to extend LLM's ability from small tasks to real-world applications are amazing.

Weaknesses

1. **Missing critical baselines and graph-specific LLM comparisons** The paper only compares against zero-shot and SFT baselines from the same model family. It omits comparisons with: - Graph-specialized LLMs cited in related work (G-Retriever, GraphWiz, and GraphRAG family for Multi-hop QA). - Strong prompting baselines (chain-of-thought with synthetic examples, PiVe's iterative verification) - Multi-task training jointly on synthetic+real-world data - NLGift with extended training as an SFT

Reviewer 05Rating 2Confidence 5

Strengths

The paper shows the generalization of graph-connectivity and shortest-path problems to real-world multi-hop reasoning tasks. It is an interesting direction, and the analysis of their relationships is worth exploring.

Weaknesses

1. "Unlocking generalizable graph reasoning by post-training alignment on synthetic data" is not a new idea, as already verified by previous papers [1,2]. 2. The novelty of training methods (solution-based or process-based GRPO/DPO) is lacking. 3. The analysis of the relationship between synthetic graph reasoning and real-world multi-hop reasoning is lacking. The analysis in Section 5 can not explain the experimental results in Section 4. (See details in Question 3) [1]. G1: Teaching LLMs t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Artificial Intelligence in Healthcare · Graph Theory and Algorithms

MethodsDirect Preference Optimization