TL;DR
This paper introduces a statistically principled framework for discovering interpretable insights from unstructured data using high-dimensional hypothesis testing, interpretability methods, and natural language descriptions.
Contribution
It presents a novel, flexible approach combining AI interpretability, selective inference, and high-dimensional testing for discovery in unstructured datasets.
Findings
Framework enables robust, interpretable discoveries from text, audio, and video data.
Provides open-source code for implementation and validation.
Applied to economics data for descriptive and causal insights.
Abstract
Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) manually pre-specify all important aspects of the unstructured data to measure; they are interested in "discovery." This paper proposes a general and flexible framework for pursuing such discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on AI interpretability to map unstructured data points to high-dimensional, sparse, and interpretable "concept embeddings"; computes statistics from these concept embeddings for testing interpretable, concept-by-concept hypotheses;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
