Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI   with a Focus on Model Confidence

Norbert Tihanyi; Tamas Bisztray; Richard A. Dubniczky; Rebeka Toth,; Bertalan Borsos; Bilel Cherif; Mohamed Amine Ferrag; Lajos Muzsai; Ridhi; Jain; Ryan Marinelli; Lucas C. Cordeiro; Merouane Debbah; Vasileios; Mavroeidis; Audun Josang

arXiv:2410.15490·cs.AI·November 26, 2024

Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence

Norbert Tihanyi, Tamas Bisztray, Richard A. Dubniczky, Rebeka Toth,, Bertalan Borsos, Bilel Cherif, Mohamed Amine Ferrag, Lajos Muzsai, Ridhi, Jain, Ryan Marinelli, Lucas C. Cordeiro, Merouane Debbah, Vasileios, Mavroeidis, Audun Josang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Dynamic Intelligence Assessment (DIA), a new benchmarking framework with dynamic questions and metrics to evaluate AI models' problem-solving, confidence, and reliability across multiple disciplines, revealing significant gaps in current models.

Contribution

The paper presents DIA, a novel dynamic benchmarking methodology and dataset that challenge AI models with mutable questions and assess their confidence and reliability, advancing beyond static benchmarks.

Findings

01

Models often answer simple questions incorrectly in varied formats.

02

GPT-4o tends to overestimate its mathematical abilities.

03

OpenAI's o1-mini shows the best self-assessment judgment.

Abstract

As machine intelligence evolves, the need to test and compare the problem-solving abilities of different AI models grows. However, current benchmarks are often simplistic, allowing models to perform uniformly well and making it difficult to distinguish their capabilities. Additionally, benchmarks typically rely on static question-answer pairs that the models might memorize or guess. To address these limitations, we introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models using dynamic question templates and improved metrics across multiple disciplines such as mathematics, cryptography, cybersecurity, and computer science. The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats, including text, PDFs, compiled binaries, visual puzzles, and CTF-style cybersecurity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dia-bench/DIA-Bench
dataset· 424 dl
424 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCompetitive and Knowledge Intelligence · AI-based Problem Solving and Planning · Cognitive Science and Mapping