PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore
Zhihao Lin, Lin Wu, Zhen Tian, and Jianglin Lan

TL;DR
PrefPoE introduces an advantage-guided preference fusion framework that stabilizes exploration in reinforcement learning, significantly improving sample efficiency and performance across various control tasks.
Contribution
It is the first to apply product-of-experts fusion for advantage-guided exploration, creating a soft trust region that enhances policy updates and exploration stability.
Findings
+321% on HalfCheetah-v4
+69% on Ant-v4
+276% on LunarLander-v2
Abstract
Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbf{PrefPoE}, a novel \textit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321\% on HalfCheetah-v4 (1276~~5375), +69\% on Ant-v4, +276\% on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Exploration remains a central challenge in RL, and the idea of guiding exploration based on advantage estimates is both intuitive and meaningful. 2. The paper provides solid algorithmic detail and theoretical justification, making the contribution more credible. 3. Experiments across multiple environments show strong performance gains and stability improvements. 4. The authors analyze the contribution of core components, which helps clarify the source of performance improvement.
1. The tested environments are mostly standard locomotion tasks (HalfCheetah, Ant, LunarLander) with relatively simple reward landscapes. PrefPoE may perform well there, but it remains unclear how it handles more complex or multimodal tasks (e.g., Humanoid or HumanoidBench tasks [1]) where local optima and sparse rewards dominate. 2. The paper focuses mainly on PPO-based variants. Other exploration-focused methods (e.g., RND [2], ICM [3], or parameter noise approaches) are not compared. This ma
1. The performance of PrefPoE is very strong compared to vannila PPO on the environments of choice. 2. The motivation is very clear
1. The motivation and background of the problem is clearly stated, however, the related work part does include the papers that are directly related to this problem: * **Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning (2021), by Whitney et al.**: This paper is most directly related to this PrefPoE, which uses the same "fused distribution" to sample actions. The only difference is that they use a curiosity-based exploration policy for exploration instea
The authors provide some theoretical understanding regarding the nature of the preference policy, that focuses exploration on the regions that are more promising according to the current estimates of the advantage function. Nice intuitions about how the exploration-exploitation works are provided.
Although a comparison with PPO is provided, a comparison with the state-of-the-art SAC is not provided. For instance, in Table 1-2 in reference https://openreview.net/forum?id=HhbHw2yInZ the authors can see that the SAC performance is larger than the one provided in the current article. Also, apparently, the PPO result for the ant is better than the one provided by the authors. I understand that there could be some implementation differences, but then it is unclear how robust results are to th
1. **Sound technical approach**: The use of PoF mechanism to reshape action distributions is technically sound. Incorporating an auxiliary preference head that considers advantage estimates provides an interpretable way to bias action selection toward potentially beneficial states.
1. **Weak Theoretical Motivation** The central claim that *exploration should be guided by advantage estimates rather than being uniformly distributed* lacks rigorous justification. This weakens the paper's foundational motivation. Moreover, the characterization of standard trajectory rollout as "uniformly sampling" is inaccurate that standard policy gradient methods already bias sampling toward actions with higher probabilities under the learned policy. Value information is implicitly incorpora
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
