Data Efficient Evaluation of Large Language Models and Text-to-Image   Models via Adaptive Sampling

Cong Xu; Gayathri Saranathan; Mahammad Parwez Alam; Arpit Shah; James; Lim; Soon Yee Wong; Foltin Martin; Suparna Bhattacharya

arXiv:2406.15527·cs.LG·June 25, 2024

Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling

Cong Xu, Gayathri Saranathan, Mahammad Parwez Alam, Arpit Shah, James, Lim, Soon Yee Wong, Foltin Martin, Suparna Bhattacharya

PDF

Open Access

TL;DR

SubLIME introduces an adaptive sampling framework that significantly reduces the computational cost of evaluating large language and text-to-image models while maintaining accurate model rankings and performance insights.

Contribution

The paper presents SubLIME, a novel adaptive sampling method that efficiently evaluates models across diverse benchmarks with minimal data, ensuring reliable rankings and reducing evaluation costs.

Findings

01

Quality-based sampling achieves high correlation (0.85-0.95) with full datasets at 10% sample rate.

02

Clustering methods perform well on specific benchmarks like MMLU.

03

A 1% sampling rate is effective for certain benchmarks like MMLU.

Abstract

Evaluating LLMs and text-to-image models is a computationally intensive task often overlooked. Efficient evaluation is crucial for understanding the diverse capabilities of these models and enabling comparisons across a growing number of new models and benchmarks. To address this, we introduce SubLIME, a data-efficient evaluation framework that employs adaptive sampling techniques, such as clustering and quality-based methods, to create representative subsets of benchmarks. Our approach ensures statistically aligned model rankings compared to full datasets, evidenced by high Pearson correlation coefficients. Empirical analysis across six NLP benchmarks reveals that: (1) quality-based sampling consistently achieves strong correlations (0.85 to 0.95) with full datasets at a 10\% sampling rate such as Quality SE and Quality CPD (2) clustering methods excel in specific benchmarks such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer