S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large   Language Models

Fangyu Lei; Qian Liu; Yiming Huang; Shizhu He; Jun Zhao; Kang Liu

arXiv:2310.15147·cs.CL·April 9, 2024·1 cites

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Fangyu Lei, Qian Liu, Yiming Huang, Shizhu He, Jun Zhao, Kang Liu

PDF

Open Access 2 Repos

TL;DR

S3Eval is a synthetic evaluation suite designed to systematically and scalably assess large language models' capabilities across long contexts and diverse tasks, addressing the challenge of evaluating models with extensive processing abilities.

Contribution

The paper introduces S3Eval, a novel synthetic evaluation framework that enables controlled, scalable, and systematic probing of LLMs' capabilities, with demonstrated correlation to real-world benchmarks.

Findings

01

S3Eval correlates well with real-world benchmark performance.

02

S3Eval-Standard dataset challenges existing LLMs significantly.

03

S3Eval allows flexible, infinite long-context data generation.

Abstract

The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like long-context understanding and reasoning. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 200K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval and real-world benchmarks demonstrates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques