L-Eval: Instituting Standardized Evaluation for Long Context Language   Models

Chenxin An; Shansan Gong; Ming Zhong; Xingjian Zhao; Mukai Li; Jun; Zhang; Lingpeng Kong; Xipeng Qiu

arXiv:2307.11088·cs.CL·October 5, 2023·6 cites

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun, Zhang, Lingpeng Kong, Xipeng Qiu

PDF

Open Access 3 Repos 4 Models

TL;DR

This paper introduces L-Eval, a standardized evaluation suite for long context language models, highlighting the limitations of existing metrics and proposing improved evaluation methods including LLM judges.

Contribution

The paper develops a comprehensive evaluation benchmark for LCLMs and investigates the effectiveness of various metrics, advocating for length-instruction-enhanced evaluation and LLM judges.

Findings

01

Popular n-gram metrics do not align well with human judgment.

02

LIE evaluation and LLM judges provide more reliable assessments.

03

Empirical analysis of 16 models reveals insights into LCLM performance.

Abstract

Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k $\sim$ 200k tokens). On the other hand, we investigate the effectiveness in evalution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsAttention Is All You Need · Layer Normalization · Label Smoothing · Linear Layer · Multi-Head Attention · Softmax · Dense Connections · Dropout · Byte Pair Encoding · Residual Connection