Policy-Based Trajectory Clustering in Offline Reinforcement Learning
Hao Hu, Xinqi Wang, Simon Shaolei Du

TL;DR
This paper introduces a new approach for clustering offline RL trajectories based on underlying policies, proposing two methods that effectively identify meaningful trajectory groups with theoretical guarantees and practical validation.
Contribution
It presents novel policy-based clustering methods, PG-Kmeans and CAAE, with theoretical convergence proofs and demonstrated effectiveness on standard datasets.
Findings
Both methods successfully cluster trajectories into meaningful groups.
Theoretical proof of finite-step convergence for PG-Kmeans.
Effective performance validated on D4RL and GridWorld environments.
Abstract
We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories. By leveraging the connection between the KL-divergence of offline trajectory distributions and a mixture of policy-induced distributions, we formulate a natural clustering objective. To solve this, we propose Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE). PG-Kmeans iteratively trains behavior cloning (BC) policies and assigns trajectories based on policy generation probabilities, while CAAE resembles the VQ-VAE framework by guiding the latent representations of trajectories toward the vicinity of specific codebook entries to achieve clustering. Theoretically, we prove the finite-step convergence of PG-Kmeans and identify a key challenge in offline trajectory clustering: the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper investigated an important challenges in offline RL, to handle the shifting trajectory distributions in data. - The experimental results on a set of tasks in D4RL and GridWorld show the superior performance of proposed method in terms of normalized mutual information, compared to clustering baselines.
- Motivations are not very clear to me. I believe the investigated problem is important, but the discussions and claims in paper are not closed linked to the investigated problem (multi-modal or heterogeneity of offline datasets). For example: 1) The multi-modality is indeed a challenge in RL, but I don’t fully agree with the authors when they simple discussed the policy distribution-shifting scenario when introducing multi-modality, that could usually be also considered as mixture types of inpu
The authors make an interesting connection between trajectory clustering and the colouring problem, and point out the inherent ambiguity of the formulation. Additionally, empirical analysis demonstrates that both proposed algorithms can cluster with high accuracy across a range of heterogeneous offline RL datasets.
Trajectory clustering by itself has very limited usage in real-world settings, as it is challenging to gather high-quality data. Although the authors do provide several meaningful applications of trajectory clustering in lines 49-67, they do not validate their claims with empirical results, which raises some concerns about the significance of the work. Although, Section D presents some empirical evidence where clustering improves the performance of the downstream algorithm, CQL and IQL are prett
This paper is well-written and easy to follow, and explores an interesting field, trajectory clustering in offline RL. The paper conducts various analyses to validate the proposed models in experiments and the appendix.
As the authors mentioned, the paper's completeness is limited by a lack of validation through experiments with large-scale datasets and by further theoretical analysis. Please see the questions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Reinforcement Learning in Robotics · Anomaly Detection Techniques and Applications
MethodsVQ-VAE
