TL;DR
This paper quantifies redundancy in clinical notes using information-theoretic and semantic models, revealing significant duplication and inefficiency in clinical language models compared to open-domain models.
Contribution
It introduces two novel strategies to measure clinical text redundancy and evaluates their effectiveness using large-scale clinical datasets and language models.
Findings
Clinical text is 1.5 to 3 times less efficient for language models than open-domain text.
Manual evaluation shows high correlation between redundancy measures and actual text duplication.
Redundancy measures can help improve clinical documentation and NLP applications.
Abstract
The current mode of use of Electronic Health Record (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to a propagation of errors, inconsistencies and misreporting of care. Therefore, quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives. This work is a quantitative examination of information redundancy in EHR notes. We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model. We evaluate the measures by training large Transformer-based language models using clinical text from a large openly available US-based ICU dataset and a large multi-site UK based Trust. By comparing the information-theoretic content of the trained models with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
