Evaluating LLM Metrics Through Real-World Capabilities
Justin K Miller, Wenjia Tang

TL;DR
This paper evaluates large language models based on real-world user tasks, identifying gaps in existing benchmarks and highlighting Google Gemini's superior performance in practical utility metrics.
Contribution
It introduces a human-centered evaluation framework for LLMs focusing on real-world capabilities and assesses how current benchmarks align with actual user needs.
Findings
Existing benchmarks have significant gaps in coverage and interpretability.
Google Gemini outperforms other models on real-world utility metrics.
Most benchmarks do not adequately measure practical usefulness.
Abstract
As generative AI becomes increasingly embedded in everyday workflows, it is important to evaluate its performance in ways that reflect real-world usage rather than abstract notions of intelligence. Unlike many existing benchmarks that assess general intelligence, our approach focuses on real-world utility, evaluating how well models support users in everyday tasks. While current benchmarks emphasize code generation or factual recall, users rely on AI for a much broader range of activities-from writing assistance and summarization to citation formatting and stylistic feedback. In this paper, we analyze large-scale survey data and usage logs to identify six core capabilities that represent how people commonly use Large Language Models (LLMs): Summarization, Technical Assistance, Reviewing Work, Data Structuring, Generation, and Information Retrieval. We then assess the extent to which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · Linear Layer · Weight Decay · Adam · Multi-Head Attention
