SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

Jamelle Watson-Daniels; Himaghna Bhattacharjee; Skyler Wang; Brandon Handoko; Antonio Li; Anaelia Ovalle; Mahesh Pasupuleti; Candace Ross; Vidya Sarma; Arjun Subramonian; Karen Ullrich; Will van der Vaart; Yijing Xin; Maximilian Nickel

arXiv:2605.06444·cs.AI·May 8, 2026

SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

Jamelle Watson-Daniels, Himaghna Bhattacharjee, Skyler Wang, Brandon Handoko, Antonio Li, Anaelia Ovalle, Mahesh Pasupuleti, Candace Ross, Vidya Sarma, Arjun Subramonian, Karen Ullrich, Will van der Vaart, Yijing Xin, Maximilian Nickel

PDF

TL;DR

SCRuB introduces a novel evaluation framework for assessing large language models' reasoning about social concepts using expert-validated prompts and a critical thinking rubric, revealing models outperform humans in this domain.

Contribution

The paper presents SCRuB, a systematic, expert-grounded evaluation method for social concept reasoning in LLMs, including a large dataset and ensemble validation approach.

Findings

01

Models outperform human experts across all rubric dimensions.

02

In 1,170 pairwise comparisons, models were preferred 74.4% of the time.

03

The single-turn evaluation format has reached its performance saturation.

Abstract

While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.