Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

Hao Peng; Qingqing Cao; Jesse Dodge; Matthew E. Peters; Jared; Fernandez; Tom Sherborne; Kyle Lo; Sam Skjonsberg; Emma Strubell; Darrell; Plessas; Iz Beltagy; Evan Pete Walsh; Noah A. Smith; Hannaneh Hajishirzi

arXiv:2307.09701·cs.CL·July 20, 2023·1 cites

Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared, Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell, Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi

PDF

Open Access

TL;DR

Pentathlon is a comprehensive benchmark platform designed to evaluate NLP model efficiency across multiple metrics in a controlled, real-world scenario, promoting fair comparisons and environmental awareness.

Contribution

It introduces a standardized, holistic evaluation framework with a controlled hardware setup and diverse metrics, addressing current challenges in model efficiency assessment.

Findings

01

Provides a unified platform for efficiency evaluation

02

Includes metrics like latency, throughput, energy, and memory

03

Facilitates fair, reproducible comparisons across models

Abstract

Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification

MethodsLib · fail