The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations
Carlos Arriaga, Gonzalo Mart\'inez, Eneko Sendin, Javier Conde, Pedro Reviriego

TL;DR
The paper introduces GEA, a human evaluation arena for LLMs that incorporates energy consumption data, revealing users tend to prefer smaller, energy-efficient models when energy awareness is considered.
Contribution
GEA is a novel evaluation framework that integrates energy consumption information into human assessments of LLMs, addressing scalability and energy-awareness in model evaluation.
Findings
Users prefer smaller, energy-efficient models when energy consumption is highlighted.
Energy-aware evaluations influence model ranking towards less energy-intensive options.
Most users do not perceive additional value in larger models when energy costs are considered.
Abstract
The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
