# Combining Lexical and Syntactic Features for Detecting Content-dense   Texts in News

**Authors:** Yinfei Yang, Ani Nenkova

arXiv: 1704.00440 · 2017-04-04

## TL;DR

This paper investigates the prevalence of content-dense texts in news articles across various domains, developing a supervised classifier that combines lexical and syntactic features to accurately detect content density.

## Contribution

It introduces a domain-specific and domain-independent supervised model for detecting content-dense news texts, highlighting the variability across domains and biases in human annotation.

## Key findings

- Approximately half of the news texts are not content-dense.
- Domain-specific classifiers outperform general classifiers in certain domains.
- Classification accuracy is around 80% across all conditions.

## Abstract

Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense. Here we empirically test this assumption on news articles from the business, U.S. international relations, sports and science journalism domains. Our findings clearly indicate that about half of the news texts in our study are in fact not content-dense and motivate the development of a supervised content-density detector. We heuristically label a large training corpus for the task and train a two-layer classifying model based on lexical and unlexicalized syntactic features. On manually annotated data, we compare the performance of domain-specific classifiers, trained on data only from a given news domain and a general classifier in which data from all four domains is pooled together. Our annotation and prediction experiments demonstrate that the concept of content density varies depending on the domain and that naive annotators provide judgement biased toward the stereotypical domain label. Domain-specific classifiers are more accurate for domains in which content-dense texts are typically fewer. Domain independent classifiers reproduce better naive crowdsourced judgements. Classification prediction is high across all conditions, around 80%.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.00440/full.md

## Figures

28 figures with captions in the complete paper: https://tomesphere.com/paper/1704.00440/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/1704.00440/full.md

---
Source: https://tomesphere.com/paper/1704.00440