The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

Carlos Arriaga; Gonzalo Mart\'inez; Eneko Sendin; Javier Conde; Pedro Reviriego

arXiv:2507.13302·cs.AI·July 18, 2025

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

Carlos Arriaga, Gonzalo Mart\'inez, Eneko Sendin, Javier Conde, Pedro Reviriego

PDF

Open Access

TL;DR

The paper introduces GEA, a human evaluation arena for LLMs that incorporates energy consumption data, revealing users tend to prefer smaller, energy-efficient models when energy awareness is considered.

Contribution

GEA is a novel evaluation framework that integrates energy consumption information into human assessments of LLMs, addressing scalability and energy-awareness in model evaluation.

Findings

01

Users prefer smaller, energy-efficient models when energy consumption is highlighted.

02

Energy-aware evaluations influence model ranking towards less energy-intensive options.

03

Most users do not perceive additional value in larger models when energy costs are considered.

Abstract

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling