Continual Task Learning through Adaptive Policy Self-Composition

Shengchao Hu; Yuhang Zhou; Ziqing Fan; Jifeng Hu; Li Shen; Ya Zhang,; Dacheng Tao

arXiv:2411.11364·cs.LG·November 19, 2024

Continual Task Learning through Adaptive Policy Self-Composition

Shengchao Hu, Yuhang Zhou, Ziqing Fan, Jifeng Hu, Li Shen, Ya Zhang,, Dacheng Tao

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces CompoFormer, a novel transformer-based approach for continual offline reinforcement learning that adaptively combines previous policies to improve learning stability and plasticity across multiple tasks.

Contribution

The paper presents CompoFormer, a structure-based transformer model that dynamically composes prior policies using a meta-policy network, addressing catastrophic forgetting in continual offline RL.

Findings

01

CompoFormer outperforms traditional CL methods in longer task sequences.

02

It effectively balances plasticity and stability in continual offline RL.

03

The approach leverages semantic correlations to selectively integrate prior knowledge.

Abstract

Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

1. the paper is well written. 2. CompoFormer is capable of retaining knowledge across tasks, mitigating catastrophic forgetting effectively.

Weaknesses

1. The reliance on textual descriptions for attention scores may not generalize well to domains lacking explicit task semantics. 2. It introduces extra training costs, especially for evaluating the ability of the composed policy to solve the current task, i.e., the performance comparison between policies.

Reviewer 02Rating 3Confidence 4

Strengths

Strengths: - Sequential learning settings are interesting and important to study - Method is a combination of some nice ideas including low-rank adaptation and leveraging task strings - Lots of comparisons to prior works in the experiments

Weaknesses

For the problem setting & benchmark 1. One of the assumptions seems unrealistic — The problem setting seems to assume that datasets from previous tasks cannot be accessed at all. It seems unrealistic that you wouldn’t be able to store any of it. For example, many large scale ML systems seem to be more limited by compute than hard drive space, and if there is so much data that it doesn’t fit on a hard drive, then it makes sense to remove some of the data from the current task to accommodate data

Reviewer 03Rating 5Confidence 3

Strengths

1. Novel method to combine policies from a fixed set of policies through attention mechanism, as well as the way to optionally add new policies to this set by learning LoRA parameters for the base model (transformer). Overall, using attention-based mechanism for combining policies is promising and novel. There is already evidence that it is a very effective way to combine knowledge (see, for example Flamingo paper (https://arxiv.org/abs/2204.14198) for how attention-based mechanism is used to c

Weaknesses

# Major weakness: ## Why Continual Offline Reinforcement Learning ? The paper lacks a clear discussion / motivation for the considered Continual Offline Reinforcement Learning setting. It mainly postulates that this setting is important or that successfully solving this is a natural requirement for a long-lived agents. However, it is unclear why one should care about this setting. In particular, if one has access to a sequence of offline dataset, a natural thing to do would be to just combine

Reviewer 04Rating 6Confidence 3

Strengths

* The authors propose a sound and flexible continual RL method. * Strong experimental results and analysis. * Clarity of writing

Weaknesses

My only concern is on the importance of the continual offline RL problem, instead of a more general multi-task offline RL problem. It seems like doing continual learning is just imposing an artificial constraint on what order the agent sees the data, one that the “Prune” version of this method is actually violating when it re-trains on old tasks. Meanwhile doing multi-task offline RL with the same data assumptions can result in more effective algorithms that do not have to deal with catastroph

Code & Models

Repositories

charleshsc/CompoFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Knowledge Management and Sharing · Personal Information Management and User Behavior