TL;DR
This paper introduces a novel method combining topic modeling and word embeddings to identify and interpret latent themes in text data, exemplified by analyzing violent death reports with insights into gender biases.
Contribution
The paper presents Discourse Atom Topic Modeling, a new approach that integrates embeddings and topic modeling to uncover interpretable latent topics in unstructured text.
Findings
Identified 225 latent topics in violent death narratives.
Revealed gender biases in topics related to violence.
Provided detailed analysis of reporting patterns and gendered language.
Abstract
There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
