Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as   the Key

Yingda Chen; Xingjun Wang; Jintao Huang; Yunlin Mao; Daoze Zhang and; Yuze Zhao

arXiv:2410.10210·cs.CL·October 16, 2024

Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key

Yingda Chen, Xingjun Wang, Jintao Huang, Yunlin Mao, Daoze Zhang and, Yuze Zhao

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper demonstrates that carefully curated high-quality data can significantly enhance large language models' ability to generate long outputs with minimal tuning and compute, applicable across various models.

Contribution

It introduces a data-centric tuning approach that improves long-output capabilities of LLMs using limited data and compute, applicable to multiple models.

Findings

01

High-quality data improves long output generation across models.

02

Minimal data and compute can achieve significant performance gains.

03

The curated dataset and tuning methods are publicly available.

Abstract

As large language models rapidly evolve to support longer context, there is a notable disparity in their capability to generate output at greater lengths. Recent study suggests that the primary cause for this imbalance may arise from the lack of data with long-output during alignment training. In light of this observation, attempts are made to re-align foundation models with data that fills the gap, which result in models capable of generating lengthy output when instructed. In this paper, we explore the impact of data-quality in tuning a model for long output, and the possibility of doing so from the starting points of human-aligned (instruct or chat) models. With careful data curation, we show that it possible to achieve similar performance improvement in our tuned models, with only a small fraction of training data instances and compute. In addition, we assess the generalizability of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

lenML/longwriter-6k-filtered
dataset· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods