Modeling Text Complexity using a Multi-Scale Probit

Johan Falkenjack; Mattias Villani; and Arne J\"onsson

arXiv:1811.04653·stat.AP·November 13, 2018·1 cites

Modeling Text Complexity using a Multi-Scale Probit

Johan Falkenjack, Mattias Villani, and Arne J\"onsson

PDF

Open Access

TL;DR

This paper introduces a multi-scale probit model for analyzing text complexity across different annotation schemes, effectively combining diverse corpora to improve predictive accuracy in readability assessment.

Contribution

The paper presents a novel multi-scale probit model with a Gibbs sampler for text complexity analysis, enabling integration of heterogeneous annotation data.

Findings

01

Effective combination of multiple corpora with different annotation schemes.

02

Promising predictive performance on simulated and real readability data.

03

Addresses the p>n problem in text complexity modeling.

Abstract

We present a novel model for text complexity analysis which can be fitted to ordered categorical data measured on multiple scales, e.g. a corpus with binary responses mixed with a corpus with more than two ordered outcomes. The multiple scales are assumed to be driven by the same underlying latent variable describing the complexity of the text. We propose an easily implemented Gibbs sampler to sample from the posterior distribution by a direct extension of established data augmentation schemes. By being able to combine multiple corpora with different annotation schemes we can get around the common problem of having more text features than annotated documents, i.e. an example of the $p > n$ problem. The predictive performance of the model is evaluated using both simulated and real world readability data with very promising results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling