CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

Satyam Dwivedi; Sanjukta Ghosh; Shivam Dwivedi; Nishi Kumari; Anil Thakur; Anurag Purushottam; Deepak Alok; Praveen Gatla; Manjuprasad B; Bipasha Patgiri

arXiv:2602.06446·cs.CL·February 9, 2026

CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

Satyam Dwivedi, Sanjukta Ghosh, Shivam Dwivedi, Nishi Kumari, Anil Thakur, Anurag Purushottam, Deepak Alok, Praveen Gatla, Manjuprasad B, Bipasha Patgiri

PDF

Open Access 1 Datasets

TL;DR

This paper introduces CORE, a large dataset and benchmark for evaluating LLMs on their ability to distinguish meaningful semantic relations from unrelated pairs, revealing significant gaps in current models' reasoning capabilities.

Contribution

The paper presents CORE, a comprehensive dataset and benchmark for assessing LLMs' understanding of semantic relations and unrelatedness, highlighting their limitations in domain-specific reasoning.

Findings

01

LLMs perform poorly on unrelated pairs, with accuracy dropping below 50%.

02

Human accuracy on the dataset is 92.6%, indicating a substantial gap.

03

Models show increased calibration error and semantic collapse on unrelated pairs.

Abstract

Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

vaikhari-ai/core
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Artificial Intelligence in Healthcare and Education