Continual Task Learning through Adaptive Policy Self-Composition
Shengchao Hu, Yuhang Zhou, Ziqing Fan, Jifeng Hu, Li Shen, Ya Zhang,, Dacheng Tao

TL;DR
This paper introduces CompoFormer, a novel transformer-based approach for continual offline reinforcement learning that adaptively combines previous policies to improve learning stability and plasticity across multiple tasks.
Contribution
The paper presents CompoFormer, a structure-based transformer model that dynamically composes prior policies using a meta-policy network, addressing catastrophic forgetting in continual offline RL.
Findings
CompoFormer outperforms traditional CL methods in longer task sequences.
It effectively balances plasticity and stability in continual offline RL.
The approach leverages semantic correlations to selectively integrate prior knowledge.
Abstract
Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. the paper is well written. 2. CompoFormer is capable of retaining knowledge across tasks, mitigating catastrophic forgetting effectively.
1. The reliance on textual descriptions for attention scores may not generalize well to domains lacking explicit task semantics. 2. It introduces extra training costs, especially for evaluating the ability of the composed policy to solve the current task, i.e., the performance comparison between policies.
Strengths: - Sequential learning settings are interesting and important to study - Method is a combination of some nice ideas including low-rank adaptation and leveraging task strings - Lots of comparisons to prior works in the experiments
For the problem setting & benchmark 1. One of the assumptions seems unrealistic — The problem setting seems to assume that datasets from previous tasks cannot be accessed at all. It seems unrealistic that you wouldn’t be able to store any of it. For example, many large scale ML systems seem to be more limited by compute than hard drive space, and if there is so much data that it doesn’t fit on a hard drive, then it makes sense to remove some of the data from the current task to accommodate data
1. Novel method to combine policies from a fixed set of policies through attention mechanism, as well as the way to optionally add new policies to this set by learning LoRA parameters for the base model (transformer). Overall, using attention-based mechanism for combining policies is promising and novel. There is already evidence that it is a very effective way to combine knowledge (see, for example Flamingo paper (https://arxiv.org/abs/2204.14198) for how attention-based mechanism is used to c
# Major weakness: ## Why Continual Offline Reinforcement Learning ? The paper lacks a clear discussion / motivation for the considered Continual Offline Reinforcement Learning setting. It mainly postulates that this setting is important or that successfully solving this is a natural requirement for a long-lived agents. However, it is unclear why one should care about this setting. In particular, if one has access to a sequence of offline dataset, a natural thing to do would be to just combine
* The authors propose a sound and flexible continual RL method. * Strong experimental results and analysis. * Clarity of writing
My only concern is on the importance of the continual offline RL problem, instead of a more general multi-task offline RL problem. It seems like doing continual learning is just imposing an artificial constraint on what order the agent sees the data, one that the “Prune” version of this method is actually violating when it re-trains on old tasks. Meanwhile doing multi-task offline RL with the same data assumptions can result in more effective algorithms that do not have to deal with catastroph
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Knowledge Management and Sharing · Personal Information Management and User Behavior
