Multi-environment Topic Models
Dominic Sobhani, Amir Feder, David Blei

TL;DR
The paper introduces the Multi-environment Topic Model (MTM), an unsupervised approach that disentangles global and environment-specific topics, improving interpretability and prediction across diverse text datasets.
Contribution
The paper presents a novel probabilistic model that separates global and environment-specific topics, enhancing interpretability and causal inference in multi-environment text data.
Findings
MTM produces interpretable global and environment-specific topics.
MTM outperforms baselines in and out-of-distribution.
MTM enables accurate causal effect estimation.
Abstract
Probabilistic topic models are a powerful tool for extracting latent themes from large text datasets. In many text datasets, we also observe per-document covariates (e.g., source, style, political affiliation) that act as environments that modulate a "global" (environment-agnostic) topic representation. Accurately learning these representations is important for prediction on new documents in unseen environments and for estimating the causal effect of topics on real-world outcomes. To this end, we introduce the Multi-environment Topic Model (MTM), an unsupervised probabilistic model that separates global and environment-specific terms. Through experimentation on various political content, from ads to tweets and speeches, we show that the MTM produces interpretable global topics with distinct environment-specific words. On multi-environment data, the MTM outperforms strong baselines in…
Peer Reviews
Decision·Submitted to ICLR 2025
* The introduction of environment-specific topic-word distributions is interesting, which can be seen as one strength of this paper in terms of modelling. The final categorical distribution used to generate each word combines the shared per-topic word distribution, denoted as $\beta_k$, with the environment-specific per-topic word distribution represented by $\gamma_{e, z}$. * The application of the proposed model in causal estimation adds another interesting dimension to this work. Based on th
* Modeling different environments or contexts through environment-specific topic-by-word matrices resembles the approach of using hierarchical structures to model various types of document collections. For instance, the paper on "Differential Topic Models" employs a hierarchical Pitman-Yor Process to model and compare different document collections, while "ContraVis: Contrastive and Visual Topic Modeling for Comparing Document Collections" also enables the comparison of distinct document collect
MTM is a novel addition to the family of topic models, offering a generative model that introduces a global topic-environment-word distribution, $\gamma$, to enhance adaptability across multiple environments. By using an ARD prior to enforce sparsity, MTM is carefully designed to align with the paper’s goal of interpretable, environment-specific topic modeling. The paper presents two variations, MTM and nMTM, demonstrating flexibility in approach and an attention to design that supports robust g
- Some minor typos: - line 266 "by by" - line 443 "r\ $\hat{\theta_i}$ \ ${\hat\theta}_i$" - It would be beneficial to include a more thorough analysis of sparsity enforcement and the selection of the ARD prior versus other priors. Providing ablation studies or theoretical comparisons with alternatives (e.g., the Horseshoe prior) would strengthen the rationale behind this choice and clarify its role in enhancing MTM’s performance, as suggested by the evaluation metrics. - Expanding the
- The paper is generally well-written. - The motivation is clear. - Adjusting global topic distributions for environment-specific variation sounds interesting and feasible. - In the experiments, the perplexity results on held-out data across models show the superior performance of the proposed method.
- The number of environments in the data used in the experiments (just two or three) is rather small compared to the number of topics. - The authors evaluate the causal effects in multi-environment only with the semi-synthetic data. The semi-synthetic data is conveniently designed for the proposed method to work as expected and better than other models that are not multi-environment aware. - The proposed MTM performs slightly worse than some baseline models regarding NPMI.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Big Data Technologies and Applications · Human Mobility and Location-Based Analysis
