QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing
Grace Zhang, Ayush Jain, Injune Hwang, Shao-Hua Sun, Joseph J. Lim

TL;DR
This paper introduces Q-switch mixture of policies (QMP), a framework for sharing behaviors across tasks in multi-task reinforcement learning to improve sample efficiency and performance.
Contribution
QMP provides a new, principled method for selectively sharing behaviors using Q-functions, enhancing existing multi-task RL algorithms.
Findings
QMP improves sample efficiency in various environments.
QMP outperforms alternative behavior sharing methods.
QMP offers complementary gains over popular MTRL algorithms.
Abstract
Multi-task reinforcement learning (MTRL) aims to learn several tasks simultaneously for better sample efficiency than learning them separately. Traditional methods achieve this by sharing parameters or relabeled data between tasks. In this work, we introduce a new framework for sharing behavioral policies across tasks, which can be used in addition to existing MTRL methods. The key idea is to improve each task's off-policy data collection by employing behaviors from other task policies. Selectively sharing helpful behaviors acquired in one task to collect training data for another task can lead to higher-quality trajectories, leading to more sample-efficient MTRL. Thus, we introduce a simple and principled framework called Q-switch mixture of policies (QMP) that selectively shares behavior between different task policies by using the task's Q-function to evaluate and select useful…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is well-written with clear organization and easy to follow. - The method is elegantly simple yet theoretically sound, introducing behavior sharing through off-policy data collection rather than policy regularization, which avoids bias in the learning objective while maintaining convergence guarantees. - The proposed Q-switch mechanism provides a principled approach to selective behavior sharing that is complementary to existing MTRL methods, demonstrating consistent improvements wh
1. The Q-switch mechanism requires evaluating all task policies and Q-values at each step. While parallelization helps, this still leads to significant computational costs, especially for large task sets as evidenced by the 7+ days runtime in MT50 experiments. 2. The method heavily relies on accurate Q-function estimation to select appropriate behaviors. This dependency may lead to suboptimal behavior selection during early training or in tasks with sparse rewards where Q-function learning is un
1. QMP is designed to complement existing MTRL frameworks, such as parameter sharing and data relabeling. 2. The authors provide theoretical analysis showing that QMP’s behavior-sharing mechanism preserves the convergence guarantees of the underlying reinforcement learning algorithm. 3. The paper presents extensive experiments across various multi-task environments (e.g., manipulation, navigation, and locomotion).
1. QMP requires evaluating Q-values across multiple task policies at each decision step, which could introduce computational overhead, particularly in settings with a large number of tasks. The paper lacks a detailed comparison of computational costs between QMP and baseline methods, which would be helpful for assessing its scalability.
1. The QMP framework introduces a Q-value-based behavior selection mechanism that enables selective behavior sharing in multi-task reinforcement learning (MTRL), enhancing sample efficiency. 2. The extensive experimental results across diverse domains—manipulation, navigation, and locomotion—demonstrate QMP’s performance compared to traditional behavior-sharing baselines, showcasing its practical impact in different tasks.
1. This paper only proposes a multi-task data sampling strategy implemented within an off-policy framework, which is limited in scope and shows limited improvement in more complex tasks, such as MT10 and MT50. 2. The other algorithms in paper [1] achieved better results with fewer interactions with the environment during training. I think that the method proposed in this paper makes a limited contribution to the field of multi-task reinforcement learning. 3. It is unclear how to ensure avoidanc
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Human Pose and Action Recognition
