Learning from negative feedback, or positive feedback or both

Abbas Abdolmaleki; Bilal Piot; Bobak Shahriari; Jost Tobias; Springenberg; Tim Hertweck; Rishabh Joshi; Junhyuk Oh; Michael Bloesch,; Thomas Lampe; Nicolas Heess; Jonas Buchli; Martin Riedmiller

arXiv:2410.04166·cs.LG·March 10, 2025

Learning from negative feedback, or positive feedback or both

Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias, Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch,, Thomas Lampe, Nicolas Heess, Jonas Buchli, Martin Riedmiller

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel EM-based method that enables learning from positive, negative, or both types of feedback, expanding preference optimization to scenarios with unpaired feedback and demonstrating stable learning from negative feedback alone.

Contribution

It extends EM algorithms to incorporate negative feedback explicitly, allowing effective learning even with only one feedback type, which was not addressed by prior methods.

Findings

01

Effective learning from negative feedback alone demonstrated

02

Method outperforms existing approaches in unpaired feedback scenarios

03

Applicable to language models and decision-making policies

Abstract

Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

1. The proposed PMPO and its derivation is, to my knowledge, novel. The objective is also easy to understand and implement, and the derivation has a clear probabilistic grounding in expectation maximization. 2. The paper is clearly written and tackles a relevant topic to the ICLR community.

Weaknesses

While the method derivation is clear and well-motivated, the primary weakness in the work lies in the experiments. 1. The proposed method does not outperform DPO, the main baseline being compared to. 2. The experiments on bandit RL tasks focus on DPO as a baseline, without considering other methods used in these benchmarks. 3. The DPO baseline does not seem to use all the data given to PMPO; for instance, the end of Section 5.1 states that DPO uses "the best and worst action samples among the 4

Reviewer 02Rating 8Confidence 3

Strengths

- Tackling the relevant and complex problem of incomplete data in preference optimization, for example, only having access to a negative examples - Thorough and extensive related work making the contribution clear - Objective is intuitive and makes sense probabilistically, especially through the use of the prior - More flexible than methods like DPO and might apply to novel scenarios - Extensive empirical evaluation on a variety of tasks from control, rl, to llm preference optimization

Weaknesses

- Does introduce new hyperparameters that are potentially non-trivial to tune ($\alpha, \beta$) - Title could be more specific. For example, something mentioning the capability to learn from dis-preferred examples. This could also help to attract readers interested in this particular problem. Currently, it seems only appealing to researchers interested in probabilistic inference. - Does not improve over DPO, but might also be due to missing datasets well-suited for the setup

Reviewer 03Rating 5Confidence 3

Strengths

The proposed method allows for training with unpaired examples and accommodates scenarios where only one type of feedback—positive or negative—is available, making it widely applicable across different contexts.

Weaknesses

* Personally, I find the presentation of this paper lacking. The main formulation in equation (10) is quite intuitive and provides a straightforward extension of the previous pairwise method to a more general setting. However, the derivations in Sections 3.1 and 3.2 are tedious and difficult to follow. I question the necessity of such extensive derivation from the expectation-maximization (EM) framework. It seems possible that the authors formulated the equation first and then sought a probabili

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies