APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

Finn Rietz; Pedro Zuidberg dos Martires; Johannes Andreas Stork

arXiv:2601.19452·cs.LG·January 28, 2026

APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

Finn Rietz, Pedro Zuidberg dos Martires, Johannes Andreas Stork

PDF

Open Access 3 Reviews

TL;DR

APC-RL introduces an adaptive hierarchical method that effectively utilizes multiple demonstration priors in reinforcement learning, improving learning speed and robustness even with suboptimal or misaligned data.

Contribution

It proposes a novel adaptive policy composition framework that selectively leverages multiple data-driven priors, refining or ignoring them based on their relevance to the target task.

Findings

01

Accelerates learning with aligned demonstrations

02

Remains robust under severe misalignment

03

Leverages suboptimal demonstrations for exploration

Abstract

Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Leveraging suboptimal / misaligned demonstration data for reinforcement learning is a significant research topic, given that real-world scenarios are unlikely to have perfectly aligned demonstration data. The proposed method is a novel contribution; in particular, the reward-sharing trick which exploits invertibility of normalizing flows which enables continuous updates for all actors. The experimental section is executed thoroughly and with high quality; the ablations are convincing that both

Weaknesses

APC’s framework trains multiple parallel SAC agents and evaluates each, so (as typical for methods that target the low-sample regime) while the sample-efficiency is increased, the total compute and (likely) wall-clock time may actually increase. Clarifying the scale of this extra compute (and e.g. memory for the additional replay buffers) per task suite could be helpful. Besides Franka Kitchen, evaluation settings are relatively simple (Maze Navigation and Car Racing); demonstrating this works

Reviewer 02Rating 6Confidence 3

Strengths

1. The reward-sharing trick is particularly interesting, as it allows different actors to learn from a single data source, thereby improving sample efficiency. 2. The paper presents comprehensive experimental results across multiple environments, demonstrating the effectiveness of the proposed method.

Weaknesses

1. Since the proposed APC method learns multiple actors from the environment, it may incur additional computational costs compared to traditional methods (e.g., PARROT). Moreover, the approach requires multiple demonstration datasets, which may not be applicable in real-world scenarios. 2. The ablation study in the paper is also not sufficiently strong. For instance, it lacks an analysis of performance under limited demonstration sources (e.g., using only a single dataset) and does not include

Reviewer 03Rating 2Confidence 5

Strengths

1. Tackles a relevant problem: overcoming poor demonstrations in RL. 2. Simple and modular architecture that integrates easily with SAC. 3. Reward-sharing via NF inversion is conceptually neat and potentially sample-efficient.

Weaknesses

1. Efficiency and justification gaps: The approach may be inefficient in multiple ways: maintaining several actors, evaluating each at every step, and managing multiple replay buffers. The paper neither analyzes this computational overhead nor justifies why keeping multiple actors is preferable to simply retraining a single policy once demonstrations start degrading performance. 2. Limited generality: The method is evaluated only with SAC, with no discussion of applicability to other off-policy

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning