S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models
Fangyu Lei, Qian Liu, Yiming Huang, Shizhu He, Jun Zhao, Kang Liu

TL;DR
S3Eval is a synthetic evaluation suite designed to systematically and scalably assess large language models' capabilities across long contexts and diverse tasks, addressing the challenge of evaluating models with extensive processing abilities.
Contribution
The paper introduces S3Eval, a novel synthetic evaluation framework that enables controlled, scalable, and systematic probing of LLMs' capabilities, with demonstrated correlation to real-world benchmarks.
Findings
S3Eval correlates well with real-world benchmark performance.
S3Eval-Standard dataset challenges existing LLMs significantly.
S3Eval allows flexible, infinite long-context data generation.
Abstract
The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like long-context understanding and reasoning. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 200K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval and real-world benchmarks demonstrates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
