On the Measure of a Model: From Intelligence to Generality
Ruchira Dhar, Ninell Oldenburg, Anders Soegaard

TL;DR
This paper argues that evaluating AI models should focus on their generality, a measurable and stable trait, rather than abstract notions of intelligence, to better reflect their real-world utility across diverse tasks.
Contribution
The paper provides a conceptual and formal analysis showing that generality, not intelligence, is the key stable metric for evaluating AI models across tasks.
Findings
Generality is more stable and empirically supported than intelligence.
Evaluation should be based on measurable performance breadth.
Generality aligns with multitask learning principles.
Abstract
Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
