Content Reduction, Surprisal and Information Density Estimation for Long Documents
Shaoxiong Ji, Wei Sun, Pekka Marttinen

TL;DR
This paper investigates how information is distributed in long documents and examines the impact of content reduction techniques like summarization on information density, proposing new methods and analyzing their effectiveness across domains.
Contribution
It introduces four criteria for estimating information density in long texts and proposes an attention-based word selection method for clinical notes, with empirical validation.
Findings
Systematic differences in information density across domains
Effectiveness of attention-based word selection in medical coding
Content reduction influences information density in long documents
Abstract
Many computational linguistic methods have been proposed to study the information content of languages. We consider two interesting research questions: 1) how is information distributed over long documents, and 2) how does content reduction, such as token selection and text summarization, affect the information density in long documents. We present four criteria for information density estimation for long documents, including surprisal, entropy, uniform information density, and lexical density. Among those criteria, the first three adopt the measures from information theory. We propose an attention-based word selection method for clinical notes and study machine summarization for multiple-domain documents. Our findings reveal the systematic difference in information density of long text in various domains. Empirical results on automated medical coding from long clinical notes show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Natural Language Processing Techniques · Topic Modeling
