Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation
Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared, Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell, Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi

TL;DR
Pentathlon is a comprehensive benchmark platform designed to evaluate NLP model efficiency across multiple metrics in a controlled, real-world scenario, promoting fair comparisons and environmental awareness.
Contribution
It introduces a standardized, holistic evaluation framework with a controlled hardware setup and diverse metrics, addressing current challenges in model efficiency assessment.
Findings
Provides a unified platform for efficiency evaluation
Includes metrics like latency, throughput, energy, and memory
Facilitates fair, reproducible comparisons across models
Abstract
Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification
MethodsLib · fail
