Reliability and Learnability of Human Bandit Feedback for   Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer; Joshua Uyheng; Stefan Riezler

arXiv:1805.10627·cs.CL·December 14, 2018

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer, Joshua Uyheng, Stefan Riezler

PDF

1 Repo

TL;DR

This paper investigates the reliability of human bandit feedback in sequence-to-sequence reinforcement learning, demonstrating that reliable feedback improves reward estimation and translation quality, with potential for large-scale applications.

Contribution

The study analyzes human feedback reliability and its impact on reward learning in sequence-to-sequence RL, highlighting the effectiveness of standardized cardinal feedback.

Findings

01

Cardinal feedback shows high reliability and ease of learning.

02

Reward estimator trained on cardinal feedback improves translation by over 1 BLEU.

03

RL can be effective with small, reliable human feedback datasets.

Abstract

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator $α$ -agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

juliakreutzer/bandit-neuralmonkey
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.