BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
Eunsu Kim, Haneul Yoo, Guijin Son, Hitesh Patel, Amit Agarwal, Alice Oh

TL;DR
BenchHub is a comprehensive, dynamic benchmark platform that aggregates and classifies diverse datasets, enabling customizable and domain-specific evaluation of large language models to improve transparency and progress in the field.
Contribution
It introduces a scalable, automatically classified benchmark repository supporting continuous updates, addressing the fragmentation and customization challenges in LLM evaluation.
Findings
Model performance varies across domains.
BenchHub enables flexible, domain-aware benchmarking.
It promotes dataset reuse and transparent comparisons.
Abstract
As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate…
Peer Reviews
Decision·Submitted to ICLR 2026
- Bringing together many datasets under a single schema with fine-grained and multi-label subjects is valuable and supports more grounded analysis, given the multitude of benchmarks out there. - The results about rankings that change across subjects and sampling criteria are interesting and make the point for having a unified suite - The merging pipeline is completely automated, allowing to easily integrate new datasets.
- The evaluation is done by constraining the evaluation to MCQ or short form. This might skew the ranking results, as the original benchmark might use a different metric. This score aggregation needs to be done carefully, and it’d be useful to have also the original scores to compare against. - Even with a taxonomy, mixing and samples from different benchmarks can drift from any single benchmark purpose. - It’s not clear how sensitive models are over other sample-level attributes.
1. The data size, number of examples, models, and languages are good. 2. The automation part is a good contribution, both the classifier and the agent. 3. The point that different subsets can lead to different rankings goes through well. 4. The motivation argument in favor of more dynamic evaluation is convincing.
1. The presented solution relies on introducing a new categorization of examples. However, this merely replaces the benchmark categorization with another (yours), without providing a real domain-adaptive dynamic evaluation setup. My interests may differ significantly from the ones modeled (e.g., physics, simplification in Kordish). This is a core issue I see with the proposed solution. While the system can be extended, doing so requires substantial effort from users. Moreover, the proposed categ
- The paper is well-written and presents coherent and cogent arguments. - I personally appreciate the motivation and problem statement in the paper, and I believe it is quite relevant given the numerous "fragmented" datasets and benchmarks published every day independently. I believe that BenchHub can be the cohesive factor for this group and can help with a more holistic and tailored evaluation of LLMs. - I also appreciate the author's attempts to push for a multilingual version of BenchHub.
- As of now, I don't see any glaring errors in the paper, and I don't see any such weakness from my side. I feel the paper has answered and justified its problem statement appropriately (for me). I am open to interacting with the authors and fellow reviewers during the rebuttal phase.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Semantic Web and Ontologies · Library Science and Information Systems
