SubData: Bridging Heterogeneous Datasets to Enable Theory-Driven Evaluation of Political and Demographic Perspectives in LLMs

Pietro Bernardelle; Leon Fr\"ohling; Stefano Civelli; Gianluca Demartini

arXiv:2412.16783·cs.CL·October 14, 2025

SubData: Bridging Heterogeneous Datasets to Enable Theory-Driven Evaluation of Political and Demographic Perspectives in LLMs

Pietro Bernardelle, Leon Fr\"ohling, Stefano Civelli, Gianluca Demartini

PDF

Open Access

TL;DR

This paper introduces SubData, a Python library for standardizing diverse datasets to evaluate how well large language models align with various human perspectives, especially political and demographic viewpoints.

Contribution

It presents a novel framework combining a dataset standardization tool with a theory-driven evaluation approach for assessing LLMs' perspective alignment.

Findings

01

SubData enables flexible dataset mapping for diverse research needs

02

The framework allows testing LLMs' classification of content targeting specific demographics

03

Initial application demonstrates its effectiveness in evaluating perspective alignment

Abstract

As increasingly capable large language models (LLMs) emerge, researchers have begun exploring their potential for subjective tasks. While recent work demonstrates that LLMs can be aligned with diverse human perspectives, evaluating this alignment on downstream tasks (e.g., hate speech detection) remains challenging due to the use of inconsistent datasets across studies. To address this issue, in this resource paper we propose a two-step framework: we (1) introduce SubData, an open-source Python library designed for standardizing heterogeneous datasets to evaluate LLMs perspective alignment; and (2) present a theory-driven approach leveraging this library to test how differently-aligned LLMs (e.g., aligned with different political viewpoints) classify content targeting specific demographics. SubData's flexible mapping and taxonomy enable customization for diverse research needs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsLib · ALIGN