Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Jacob Carlson

arXiv:2511.01680·econ.EM·May 6, 2026

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Jacob Carlson

PDF

1 Repo

TL;DR

This paper introduces a statistically principled framework for discovering interpretable insights from unstructured data using high-dimensional hypothesis testing, interpretability methods, and natural language descriptions.

Contribution

It presents a novel, flexible approach combining AI interpretability, selective inference, and high-dimensional testing for discovery in unstructured datasets.

Findings

01

Framework enables robust, interpretable discoveries from text, audio, and video data.

02

Provides open-source code for implementation and validation.

03

Applied to economics data for descriptive and causal insights.

Abstract

Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) manually pre-specify all important aspects of the unstructured data to measure; they are interested in "discovery." This paper proposes a general and flexible framework for pursuing such discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on AI interpretability to map unstructured data points to high-dimensional, sparse, and interpretable "concept embeddings"; computes statistics from these concept embeddings for testing interpretable, concept-by-concept hypotheses;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.