The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

Baishi Li; Ta Yu; Kelvin J.L. Koa; Ke-Wei Huang

arXiv:2605.07409·cs.CL·May 11, 2026

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

Baishi Li, Ta Yu, Kelvin J.L. Koa, Ke-Wei Huang

PDF

TL;DR

This paper addresses the challenge of validating social measures derived from semantic embeddings, proposing a protocol and methods to ensure their scientific validity in social science research.

Contribution

It introduces the Construct Validity Protocol and Counterfactual Neutralization to improve the validity of social constructs measured via embeddings.

Findings

01

The CVP provides a systematic pipeline for validation.

02

Counterfactual Neutralization reduces confounding in embeddings.

03

A Validity Suite tests various aspects of construct validity.

Abstract

Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ( $C$ ) and confounding attributes ( $Z$ ) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.