DORB: Dynamically Optimizing Multiple Rewards with Bandits
Ramakanth Pasunuru, Han Guo, Mohit Bansal

TL;DR
This paper introduces DORB, a method that uses multi-armed bandits to dynamically optimize multiple reward metrics in language generation tasks, improving quality and diversity.
Contribution
It proposes a novel automated approach for balancing multiple reward metrics in reinforcement learning using bandits, with two specific algorithms and empirical validation.
Findings
Effective in improving automatic metrics
Enhances human evaluation scores
Demonstrates adaptability on unseen data
Abstract
Policy gradients-based reinforcement learning has proven to be a promising approach for directly optimizing non-differentiable evaluation metrics for language generation tasks. However, optimizing for a specific metric reward leads to improvements in mostly that metric only, suggesting that the model is gaming the formulation of that metric in a particular way without often achieving real qualitative improvements. Hence, it is more beneficial to make the model optimize multiple diverse metric rewards jointly. While appealing, this is challenging because one needs to manually decide the importance and scaling weights of these metric rewards. Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time. Considering the above aspects, in our work, we automate the optimization of multiple metric rewards simultaneously via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
