HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI
Tidor-Vlad Pricope

TL;DR
HardML is a new benchmark with 100 challenging questions designed to evaluate AI's knowledge and reasoning in data science and machine learning, revealing current models' limitations.
Contribution
The paper introduces HardML, a novel, carefully crafted benchmark for assessing AI's understanding and reasoning in data science and machine learning.
Findings
AI models achieve about 70% error rate on HardML.
HardML questions are more challenging than those in existing benchmarks.
HardML provides a modern, rigorous test for AI progress in data science and machine learning.
Abstract
We present HardML, a benchmark designed to evaluate the knowledge and reasoning abilities in the fields of data science and machine learning. HardML comprises a diverse set of 100 challenging multiple-choice questions, handcrafted over a period of 6 months, covering the most popular and modern branches of data science and machine learning. These questions are challenging even for a typical Senior Machine Learning Engineer to answer correctly. To minimize the risk of data contamination, HardML uses mostly original content devised by the author. Current state of the art AI models achieve a 30% error rate on this benchmark, which is about 3 times larger than the one achieved on the equivalent, well known MMLU ML. While HardML is limited in scope and not aiming to push the frontier, primarily due to its multiple choice nature, it serves as a rigorous and modern testbed to quantify and track…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Neural Networks and Applications · Explainable Artificial Intelligence (XAI)
MethodsSparse Evolutionary Training
