BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Eunsu Kim; Haneul Yoo; Guijin Son; Hitesh Patel; Amit Agarwal; Alice Oh

arXiv:2506.00482·cs.LG·June 3, 2025

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Eunsu Kim, Haneul Yoo, Guijin Son, Hitesh Patel, Amit Agarwal, Alice Oh

PDF

Open Access 5 Datasets 3 Reviews

TL;DR

BenchHub is a comprehensive, dynamic benchmark platform that aggregates and classifies diverse datasets, enabling customizable and domain-specific evaluation of large language models to improve transparency and progress in the field.

Contribution

It introduces a scalable, automatically classified benchmark repository supporting continuous updates, addressing the fragmentation and customization challenges in LLM evaluation.

Findings

01

Model performance varies across domains.

02

BenchHub enables flexible, domain-aware benchmarking.

03

It promotes dataset reuse and transparent comparisons.

Abstract

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Bringing together many datasets under a single schema with fine-grained and multi-label subjects is valuable and supports more grounded analysis, given the multitude of benchmarks out there. - The results about rankings that change across subjects and sampling criteria are interesting and make the point for having a unified suite - The merging pipeline is completely automated, allowing to easily integrate new datasets.

Weaknesses

- The evaluation is done by constraining the evaluation to MCQ or short form. This might skew the ranking results, as the original benchmark might use a different metric. This score aggregation needs to be done carefully, and it’d be useful to have also the original scores to compare against. - Even with a taxonomy, mixing and samples from different benchmarks can drift from any single benchmark purpose. - It’s not clear how sensitive models are over other sample-level attributes.

Reviewer 02Rating 4Confidence 5

Strengths

1. The data size, number of examples, models, and languages are good. 2. The automation part is a good contribution, both the classifier and the agent. 3. The point that different subsets can lead to different rankings goes through well. 4. The motivation argument in favor of more dynamic evaluation is convincing.

Weaknesses

1. The presented solution relies on introducing a new categorization of examples. However, this merely replaces the benchmark categorization with another (yours), without providing a real domain-adaptive dynamic evaluation setup. My interests may differ significantly from the ones modeled (e.g., physics, simplification in Kordish). This is a core issue I see with the proposed solution. While the system can be extended, doing so requires substantial effort from users. Moreover, the proposed categ

Reviewer 03Rating 8Confidence 4

Strengths

- The paper is well-written and presents coherent and cogent arguments. - I personally appreciate the motivation and problem statement in the paper, and I believe it is quite relevant given the numerous "fragmented" datasets and benchmarks published every day independently. I believe that BenchHub can be the cohesive factor for this group and can help with a more holistic and tailored evaluation of LLMs. - I also appreciate the author's attempts to push for a multilingual version of BenchHub.

Weaknesses

- As of now, I don't see any glaring errors in the paper, and I don't see any such weakness from my side. I feel the paper has answered and justified its problem statement appropriately (for me). I am open to interacting with the authors and fellow reviewers during the rebuttal phase.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Semantic Web and Ontologies · Library Science and Information Systems