Enterprise Large Language Model Evaluation Benchmark

Liya Wang; David Yi; Damien Jose; John Passarelli; James Gao; Jordan Leventis; and Kang Li

arXiv:2506.20274·cs.AI·June 26, 2025

Enterprise Large Language Model Evaluation Benchmark

Liya Wang, David Yi, Damien Jose, John Passarelli, James Gao, Jordan Leventis, and Kang Li

PDF

Open Access

TL;DR

This paper introduces a comprehensive enterprise-focused LLM evaluation benchmark based on Bloom's Taxonomy, addressing existing gaps by creating a scalable, multi-task framework with 9,700 samples to assess model reasoning and judgment capabilities.

Contribution

It presents a novel 14-task evaluation framework tailored for enterprise applications, along with a scalable data curation pipeline using LLMs for labeling and judging, resulting in a large, robust benchmark.

Findings

01

Open-source models like DeepSeek R1 perform comparably to proprietary models in reasoning tasks.

02

Models lag in judgment-based tasks, indicating overthinking issues.

03

The benchmark exposes critical enterprise-specific performance gaps.

Abstract

Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom's Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-a-Judge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Robotic Process Automation Applications · Collaboration in agile enterprises