APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition
Finn Rietz, Pedro Zuidberg dos Martires, Johannes Andreas Stork

TL;DR
APC-RL introduces an adaptive hierarchical method that effectively utilizes multiple demonstration priors in reinforcement learning, improving learning speed and robustness even with suboptimal or misaligned data.
Contribution
It proposes a novel adaptive policy composition framework that selectively leverages multiple data-driven priors, refining or ignoring them based on their relevance to the target task.
Findings
Accelerates learning with aligned demonstrations
Remains robust under severe misalignment
Leverages suboptimal demonstrations for exploration
Abstract
Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe…
Peer Reviews
Decision·ICLR 2026 Poster
Leveraging suboptimal / misaligned demonstration data for reinforcement learning is a significant research topic, given that real-world scenarios are unlikely to have perfectly aligned demonstration data. The proposed method is a novel contribution; in particular, the reward-sharing trick which exploits invertibility of normalizing flows which enables continuous updates for all actors. The experimental section is executed thoroughly and with high quality; the ablations are convincing that both
APC’s framework trains multiple parallel SAC agents and evaluates each, so (as typical for methods that target the low-sample regime) while the sample-efficiency is increased, the total compute and (likely) wall-clock time may actually increase. Clarifying the scale of this extra compute (and e.g. memory for the additional replay buffers) per task suite could be helpful. Besides Franka Kitchen, evaluation settings are relatively simple (Maze Navigation and Car Racing); demonstrating this works
1. The reward-sharing trick is particularly interesting, as it allows different actors to learn from a single data source, thereby improving sample efficiency. 2. The paper presents comprehensive experimental results across multiple environments, demonstrating the effectiveness of the proposed method.
1. Since the proposed APC method learns multiple actors from the environment, it may incur additional computational costs compared to traditional methods (e.g., PARROT). Moreover, the approach requires multiple demonstration datasets, which may not be applicable in real-world scenarios. 2. The ablation study in the paper is also not sufficiently strong. For instance, it lacks an analysis of performance under limited demonstration sources (e.g., using only a single dataset) and does not include
1. Tackles a relevant problem: overcoming poor demonstrations in RL. 2. Simple and modular architecture that integrates easily with SAC. 3. Reward-sharing via NF inversion is conceptually neat and potentially sample-efficient.
1. Efficiency and justification gaps: The approach may be inefficient in multiple ways: maintaining several actors, evaluating each at every step, and managing multiple replay buffers. The paper neither analyzes this computational overhead nor justifies why keeping multiple actors is preferable to simply retraining a single policy once demonstrations start degrading performance. 2. Limited generality: The method is evaluated only with SAC, with no discussion of applicability to other off-policy
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
