TL;DR
This paper empirically investigates whether Monte Carlo methods can achieve experience stitching in reinforcement learning with function approximation, challenging the traditional view that only temporal difference methods are capable of this.
Contribution
The study demonstrates that Monte Carlo methods can also perform experience stitching, especially with larger neural networks, reducing the gap traditionally attributed to TD methods.
Findings
MC methods can achieve experience stitching with function approximation
Increasing critic capacity reduces the generalization gap for both MC and TD methods
The advantage of TD methods over MC diminishes with larger models
Abstract
Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This experience stitching capability is often viewed as the purview of temporal difference (TD) methods. However, outside of small tabular settings, trajectories never intersect, calling into question this conventional wisdom. Moreover, the common belief is that Monte Carlo (MC) methods should not be able to recombine experience, yet it remains unclear whether function approximation could result in a form of implicit stitching. The goal of this paper is to empirically study whether the conventional wisdom about stitching actually holds in settings where function approximation is used. We empirically demonstrate that Monte Carlo (MC) methods can also achieve experience stitching. While TD methods do achieve slightly stronger capabilities than MC methods…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is well-motivated. While TD is often contrasted with MC by this ability to stitch, such intuitions are presented in tabular settings. The observation that generalization is necessary for this to happen with function approximation is lesser discussed. * The paper provides a useful classification of stitching regimes: no stitching, exact stitching, and generalized stitching. * They further detail an environment setup and how start/goal states can be configured to systematically test
* The evidence of MC stitching was notable in the generalized stitching regime, but surprisingly significantly less prevalent in the exact stitching case. This seems weird, if the setup which cleanly sets up trajectories for stitching led to worse stitching ability? * Assuming GCDQN (MC) is a sound algorithm, there's no comparison between GCDQN (TD) and GCDQN (MC) in the comparisons of Section 5.2, despite it being used in 5.3. It feels like it would be a fairer and more convincing comparison b
1. The authors tackle an importanat probelm of stitching performance of MC methods and TD methods. The argument that when the critic size is large, there is no significant gap in stitching performance of TD and MC methods is new and will be helpful to the community. 2. The authors try to formalize the stitching concept and the experiments are constructed in solid sense. The designed examples clearly represent the three difference scenarios of stitching that the authors consider.
1. It is not clear why three different scenarios ( exact stitching, no stitching, generalized stitching) could be representative scenarios to evaluate stitching performance. 2. Structure of presentation : The related works section and preliminaries section seems to be somewhat not balanced : preliminaries overlap with related works making the preliminaries part too short. The authors could provide more detail on their setting. For example, the replay buffer $\mathcal{D}$ is loosely defined, a
* This paper explains and discusses Stitching as a specific way RL algorithms can generalize to new problems. This investigation is novel and interesting to the goal-conditioned RL and Skill and Options communities. * The writing is coherent and easy to follow, with good use of diagrams to discuss the details of experiments. * The introduced testbed is clearly defined, and its design choices and details are explained and justified. * Discussions around different variants of Stitching and how ea
* Experiments suffer from several poor empirical practices, which do not allow the reader to fully evaluate the findings. * Untuned hyperparameters: all algorithms share the same hyperparameters, and there is no description of how they are selected. This leads me to believe that they are not tuned for this new problem. The performance of untuned algorithms can vary greatly on new problems (Patterson et al., 2023). This fact alone is a major obstacle to trusting the outcome of experiments. *
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
