Learning from negative feedback, or positive feedback or both
Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias, Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch,, Thomas Lampe, Nicolas Heess, Jonas Buchli, Martin Riedmiller

TL;DR
This paper introduces a novel EM-based method that enables learning from positive, negative, or both types of feedback, expanding preference optimization to scenarios with unpaired feedback and demonstrating stable learning from negative feedback alone.
Contribution
It extends EM algorithms to incorporate negative feedback explicitly, allowing effective learning even with only one feedback type, which was not addressed by prior methods.
Findings
Effective learning from negative feedback alone demonstrated
Method outperforms existing approaches in unpaired feedback scenarios
Applicable to language models and decision-making policies
Abstract
Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. The proposed PMPO and its derivation is, to my knowledge, novel. The objective is also easy to understand and implement, and the derivation has a clear probabilistic grounding in expectation maximization. 2. The paper is clearly written and tackles a relevant topic to the ICLR community.
While the method derivation is clear and well-motivated, the primary weakness in the work lies in the experiments. 1. The proposed method does not outperform DPO, the main baseline being compared to. 2. The experiments on bandit RL tasks focus on DPO as a baseline, without considering other methods used in these benchmarks. 3. The DPO baseline does not seem to use all the data given to PMPO; for instance, the end of Section 5.1 states that DPO uses "the best and worst action samples among the 4
- Tackling the relevant and complex problem of incomplete data in preference optimization, for example, only having access to a negative examples - Thorough and extensive related work making the contribution clear - Objective is intuitive and makes sense probabilistically, especially through the use of the prior - More flexible than methods like DPO and might apply to novel scenarios - Extensive empirical evaluation on a variety of tasks from control, rl, to llm preference optimization
- Does introduce new hyperparameters that are potentially non-trivial to tune ($\alpha, \beta$) - Title could be more specific. For example, something mentioning the capability to learn from dis-preferred examples. This could also help to attract readers interested in this particular problem. Currently, it seems only appealing to researchers interested in probabilistic inference. - Does not improve over DPO, but might also be due to missing datasets well-suited for the setup
The proposed method allows for training with unpaired examples and accommodates scenarios where only one type of feedback—positive or negative—is available, making it widely applicable across different contexts.
* Personally, I find the presentation of this paper lacking. The main formulation in equation (10) is quite intuitive and provides a straightforward extension of the previous pairwise method to a more general setting. However, the derivations in Sections 3.1 and 3.2 are tedious and difficult to follow. I question the necessity of such extensive derivation from the expectation-maximization (EM) framework. It seems possible that the authors formulated the equation first and then sought a probabili
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
