Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

Anthony Kobanda; R\'emy Portelas; Odalric-Ambrym Maillard; Ludovic Denoyer

arXiv:2412.14865·cs.LG·February 6, 2026

Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

Anthony Kobanda, R\'emy Portelas, Odalric-Ambrym Maillard, Ludovic Denoyer

PDF

Open Access 3 Reviews

TL;DR

This paper introduces HiSPO, a hierarchical policy framework for continual offline reinforcement learning that effectively adapts to new navigation tasks while retaining prior knowledge, demonstrating efficiency and scalability in diverse environments.

Contribution

The paper presents HiSPO, a novel hierarchical policy subspace method tailored for continual offline RL in navigation tasks, addressing forgetting and scalability issues.

Findings

01

HiSPO outperforms baseline methods in MuJoCo maze tasks.

02

It maintains high performance while using less memory.

03

Effective in complex video game navigation simulations.

Abstract

We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

The paper is mostly clearly. The main results and contribution are clearly highlighted, and the conclusions are succinctly declared. The proposed hierarchical subspace of policy is novel and is empirically validated. It is capable of capturing different levels of knowledge, enabling knowledge reuse and memory saving.

Weaknesses

**Limited Contribution**: Most techniques proposed in the paper are already applied in related domains. The continual subspace of policies (CSP) is proposed as a continual reinforcement learning method, adapting it to continual imitation learning is straightforward. The low-rank adaptation (LoRA) is also well-established in continual learning, and its application here does not present any distinct innovation. The only novel aspect is the hierarchical subspaces of policies. **Limited Improvement

Reviewer 02Rating 6Confidence 3

Strengths

- Effective Continual learning framework that slightly outperforms other prior work in the tradeoff of performance and relative memory size. - Construction of Benchmarks with comprehensive baselines for evaluations: set of navigation maze tasks that showcase the efficacy of HILOW and other algorithms with individual task streams.

Weaknesses

- FTN and SCN seems to be a strong baseline as seen in Figure 3 with similar performance to memory size tradeoff. Additionally, approaches such as Polyoptron, Multitask prompt tuning and similar peft approaches have been found to help with multi-task adaptation which may additionally help with memory efficiency and would be a good standard of comparison alongside these baselines. - The domains seem somewhat narrow and similar to each other (all being maze-like), which meets the offline navigati

Reviewer 03Rating 6Confidence 3

Strengths

The adaptation of an existing method such as "continual subspace of policies" to a slightly differnt flavor of reinforcement learning makes sense and should be studied. The proposed new method from said adaptation is explained well and clear. With the given information the experiments should be quite easily reproducable. The metrics and method used to evaluate the baseline and proposed algorithms are chosen well and make sense in the given context. With this in mind this study could be a small b

Weaknesses

The greatest weaknesses of the paper are the baseline comparisons. There is a chapter for preliminaries that is referenced when describing what the proposed method is compared to, but only describes categories of different methodologies in a broad way. The actual description of the baselines and why they were chosen instead of others is too short or even missing making them hard to comprehend. Is is especially unclear what part of the preliminary chapter is relevent for which baseline method, un

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Source Software Innovations