L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun, Zhang, Lingpeng Kong, Xipeng Qiu

TL;DR
This paper introduces L-Eval, a standardized evaluation suite for long context language models, highlighting the limitations of existing metrics and proposing improved evaluation methods including LLM judges.
Contribution
The paper develops a comprehensive evaluation benchmark for LCLMs and investigates the effectiveness of various metrics, advocating for length-instruction-enhanced evaluation and LLM judges.
Findings
Popular n-gram metrics do not align well with human judgment.
LIE evaluation and LLM judges provide more reliable assessments.
Empirical analysis of 16 models reveals insights into LCLM performance.
Abstract
Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k200k tokens). On the other hand, we investigate the effectiveness in evalution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsAttention Is All You Need · Layer Normalization · Label Smoothing · Linear Layer · Multi-Head Attention · Softmax · Dense Connections · Dropout · Byte Pair Encoding · Residual Connection
