DORB: Dynamically Optimizing Multiple Rewards with Bandits

Ramakanth Pasunuru; Han Guo; Mohit Bansal

arXiv:2011.07635·cs.CL·November 17, 2020

DORB: Dynamically Optimizing Multiple Rewards with Bandits

Ramakanth Pasunuru, Han Guo, Mohit Bansal

PDF

Open Access

TL;DR

This paper introduces DORB, a method that uses multi-armed bandits to dynamically optimize multiple reward metrics in language generation tasks, improving quality and diversity.

Contribution

It proposes a novel automated approach for balancing multiple reward metrics in reinforcement learning using bandits, with two specific algorithms and empirical validation.

Findings

01

Effective in improving automatic metrics

02

Enhances human evaluation scores

03

Demonstrates adaptability on unseen data

Abstract

Policy gradients-based reinforcement learning has proven to be a promising approach for directly optimizing non-differentiable evaluation metrics for language generation tasks. However, optimizing for a specific metric reward leads to improvements in mostly that metric only, suggesting that the model is gaming the formulation of that metric in a particular way without often achieving real qualitative improvements. Hence, it is more beneficial to make the model optimize multiple diverse metric rewards jointly. While appealing, this is challenging because one needs to manually decide the importance and scaling weights of these metric rewards. Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time. Considering the above aspects, in our work, we automate the optimization of multiple metric rewards simultaneously via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics