Topic Modeling for Free-Response Text Data from a Complex Survey
Namitha V.Pais, Scott H. Holan, Paul A. Parker

TL;DR
This paper enhances topic modeling for complex survey data by incorporating survey weights into the Mixture of Unigrams model, enabling unbiased extraction of themes from open-ended survey responses.
Contribution
It introduces a pseudolikelihood approach for the Mixture of Unigrams model under informative sampling and develops a hierarchical version accounting for respondent-level factors.
Findings
The weighted MoU effectively extracts meaningful topics from survey data.
The hierarchical MoU captures variations in topic proportions across respondent characteristics.
Application to ANES data demonstrates improved interpretability of survey responses.
Abstract
Topic Modeling is a popular statistical tool commonly used on textual data to identify the hidden thematic structure in a document collection based on the distribution of words. Additionally, it can be used to cluster the documents, with clusters representing distinct topics. The Mixture of Unigrams (MoU) is a standard topic model for clustering document-term data and can be particularly useful for analyzing open-ended survey responses to extract meaningful information from the underlying topics. However, with complex survey designs, where data is often collected on individual (document) characteristics, it is essential to account for the sample design in order to avoid biased estimates. To address this issue, we propose the MoU model under informative sampling using a pseudolikelihood to account for the sample design in the model by incorporating survey weights. We evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
