Large Language Models for Statistical Inference: Context Augmentation with Applications to the Two-Sample Problem and Regression
Marc Ratkovic

TL;DR
This paper introduces context augmentation, leveraging large language models to generate contexts that improve statistical inference, demonstrated through two-sample testing and text-on-text regression, with theoretical guarantees and real-world applications.
Contribution
It presents a novel method using LLM-generated contexts for valid frequentist inference, connecting LLMs with traditional statistical techniques.
Findings
Method maintains null behaviour in synthetic data tests
Approach is powerful and interpretable in replication studies
Theoretical results include identification conditions and efficiency bounds
Abstract
We introduce context augmentation, a data-augmentation approach that uses large language models (LLMs) to generate contexts around observed strings as a means of facilitating valid frequentist inference. These generated contexts serve to reintroduce uncertainty, incorporate auxiliary information, and facilitate interpretability. For example, in the two-sample test, we compare the log-probability of strings under contexts from its own versus the other group. We show on synthetic data that the method's t-statistics exhibit the expected null behaviour while maintaining power and, through a replication, that the method is powerful and interpretable. We next introduce text-on-text regression. Contexts generated around the predictor string are treated as mediating variables between the predictor and outcome strings. Using negative controls, we then distinguish between semantic and syntactic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mental Health via Writing · Computational and Text Analysis Methods
