Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran, Simon Buchholz, Bryon Aragam, Bernhard Sch\"olkopf,, Pradeep Ravikumar

TL;DR
This paper unifies causal representation learning and foundation models to learn human-interpretable concepts from data, demonstrating theoretical recoverability and practical utility through experiments on synthetic data and large language models.
Contribution
It introduces a formal framework connecting causal and foundation model approaches, enabling provable recovery of interpretable concepts from diverse datasets.
Findings
Concepts can be provably recovered from data.
Unified approach improves interpretability of models.
Experimental results on synthetic data and language models support the method.
Abstract
To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn human-interpretable concepts from data. Weaving together ideas from both fields, we formally define a notion of concepts and show that they can be provably recovered from diverse data. Experiments on synthetic data and large language models show the utility of our unified approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)
