LongReward: Improving Long-context Large Language Models with AI   Feedback

Jiajie Zhang; Zhongni Hou; Xin Lv; Shulin Cao; Zhenyu Hou; Yilin Niu,; Lei Hou; Yuxiao Dong; Ling Feng; Juanzi Li

arXiv:2410.21252·cs.CL·October 29, 2024

LongReward: Improving Long-context Large Language Models with AI Feedback

Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu Hou, Yilin Niu,, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li

PDF

Open Access 1 Repo 4 Models 1 Datasets 1 Video

TL;DR

This paper introduces LongReward, a method that uses an off-the-shelf LLM to provide multi-dimensional rewards for long-context responses, improving the performance of large language models in long-context tasks through offline RL.

Contribution

The paper proposes LongReward, a novel reward mechanism for long-context LLMs using an off-the-shelf LLM and a new assessment pipeline, combined with offline RL to enhance long-context capabilities.

Findings

01

LongReward significantly improves long-context model performance.

02

It enhances models' ability to follow short instructions.

03

Long-context and short-context DPO can be combined without performance loss.

Abstract

Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models' capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

THUDM/LongReward
pytorchOfficial

Models

Datasets

zai-org/LongReward-10k
dataset· 1.2k dl
1.2k dl

Videos

LongReward: Improving Long-context Large Language Models with AI Feedback· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsShrink and Fine-Tune · Direct Preference Optimization