Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence
Norbert Tihanyi, Tamas Bisztray, Richard A. Dubniczky, Rebeka Toth,, Bertalan Borsos, Bilel Cherif, Mohamed Amine Ferrag, Lajos Muzsai, Ridhi, Jain, Ryan Marinelli, Lucas C. Cordeiro, Merouane Debbah, Vasileios, Mavroeidis, Audun Josang

TL;DR
This paper introduces Dynamic Intelligence Assessment (DIA), a new benchmarking framework with dynamic questions and metrics to evaluate AI models' problem-solving, confidence, and reliability across multiple disciplines, revealing significant gaps in current models.
Contribution
The paper presents DIA, a novel dynamic benchmarking methodology and dataset that challenge AI models with mutable questions and assess their confidence and reliability, advancing beyond static benchmarks.
Findings
Models often answer simple questions incorrectly in varied formats.
GPT-4o tends to overestimate its mathematical abilities.
OpenAI's o1-mini shows the best self-assessment judgment.
Abstract
As machine intelligence evolves, the need to test and compare the problem-solving abilities of different AI models grows. However, current benchmarks are often simplistic, allowing models to perform uniformly well and making it difficult to distinguish their capabilities. Additionally, benchmarks typically rely on static question-answer pairs that the models might memorize or guess. To address these limitations, we introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models using dynamic question templates and improved metrics across multiple disciplines such as mathematics, cryptography, cybersecurity, and computer science. The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats, including text, PDFs, compiled binaries, visual puzzles, and CTF-style cybersecurity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCompetitive and Knowledge Intelligence · AI-based Problem Solving and Planning · Cognitive Science and Mapping
