Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A   Case-Study in E-Commerce Opinion Summarization

Swaroop Nath; Tejpalsingh Siledar; Sankara Sri Raghava Ravindra Muddu,; Rupasai Rangaraju; Harshad Khadilkar; Pushpak Bhattacharyya; Suman Banerjee,; Amey Patil; Sudhanshu Shekhar Singh; Muthusamy Chelliah; Nikesh Garera

arXiv:2402.15473·cs.CL·April 19, 2024·1 cites

Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Swaroop Nath, Tejpalsingh Siledar, Sankara Sri Raghava Ravindra Muddu,, Rupasai Rangaraju, Harshad Khadilkar, Pushpak Bhattacharyya, Suman Banerjee,, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces a domain knowledge-infused reward modeling approach for RLHF that significantly reduces human preference annotation needs, demonstrated in e-commerce opinion summarization with state-of-the-art results.

Contribution

It presents a novel reward modeling method incorporating domain knowledge, reducing annotation effort, and introduces two new datasets for opinion summarization.

Findings

01

21× reduction in preference annotation

02

~4 point ROUGE-L improvement over SOTA

03

68% of preferences favored by humans

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ( $φ$ ), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $φ$ . Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $φ$ , which reduces the amount of preference annotation required ( $21 \times$ ), omits Alignment Tax, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swaroop-nath/reward-approx-social-choice-opp-summ
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques · Recommender Systems and Techniques