LongReward: Improving Long-context Large Language Models with AI Feedback
Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu Hou, Yilin Niu,, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li

TL;DR
This paper introduces LongReward, a method that uses an off-the-shelf LLM to provide multi-dimensional rewards for long-context responses, improving the performance of large language models in long-context tasks through offline RL.
Contribution
The paper proposes LongReward, a novel reward mechanism for long-context LLMs using an off-the-shelf LLM and a new assessment pipeline, combined with offline RL to enhance long-context capabilities.
Findings
LongReward significantly improves long-context model performance.
It enhances models' ability to follow short instructions.
Long-context and short-context DPO can be combined without performance loss.
Abstract
Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models' capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsShrink and Fine-Tune · Direct Preference Optimization
