# A framework for streamlined statistical prediction using topic models

**Authors:** Vanessa Glenny, Jonathan Tuke, Nigel Bean, Lewis Mitchell

arXiv: 1904.06941 · 2019-04-16

## TL;DR

This paper introduces a classical statistical framework integrating topic models for prediction tasks on large text corpora, demonstrating comparable performance to word-based models in social sciences and humanities applications.

## Contribution

It presents a novel framework combining topic modeling with traditional statistical prediction methods, bridging NLP techniques with classical statistical analysis.

## Key findings

- Topic regression models perform comparably to word-based models.
- Framework applicable in social sciences and humanities contexts.
- Efficient data reduction with maintained predictive accuracy.

## Abstract

In the Humanities and Social Sciences, there is increasing interest in approaches to information extraction, prediction, intelligent linkage, and dimension reduction applicable to large text corpora. With approaches in these fields being grounded in traditional statistical techniques, the need arises for frameworks whereby advanced NLP techniques such as topic modelling may be incorporated within classical methodologies. This paper provides a classical, supervised, statistical learning framework for prediction from text, using topic models as a data reduction method and the topics themselves as predictors, alongside typical statistical tools for predictive modelling. We apply this framework in a Social Sciences context (applied animal behaviour) as well as a Humanities context (narrative analysis) as examples of this framework. The results show that topic regression models perform comparably to their much less efficient equivalents that use individual words as predictors.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.06941/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1904.06941/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/1904.06941/full.md

---
Source: https://tomesphere.com/paper/1904.06941