# Development of a rule-based natural language processing algorithm to extract sleep information in pediatric primary care patients with a sleep diagnosis

**Authors:** Joseph W Sirrianni, Ariana Calloway, Syed-Amad Hussain, Hongfang Liu, Christopher W Bartlett, Mattina A Davenport

PMC · DOI: 10.1093/sleepadvances/zpag014 · Sleep Advances: A Journal of the Sleep Research Society · 2026-02-13

## TL;DR

The paper introduces a new low-resource natural language processing algorithm that effectively identifies sleep-related information in pediatric clinical notes.

## Contribution

A novel sleep vocabulary for pediatric sleep mentions is developed and shown to perform well in identifying relevant clinical notes.

## Key findings

- The sleep vocabulary achieved a recall of 0.992 and precision of 0.852 in identifying sleep-related clinical notes.
- 77.1% of annotated sleep-related text spans included at least one keyword from the vocabulary.

## Abstract

The current study employed natural language processing (NLP) to capture multidimensional and transdiagnostic information in pediatric clinical notes. We present a novel, low-resource sleep vocabulary that can be applied to notes to identify pediatric sleep-related mentions automatically.

Using a combination of existing medical sleep ontologies, interviews with clinicians, and examination of clinical note narratives, we develop a novel vocabulary of pediatric sleep-related terms and phrases that covers both technical terms, abbreviations, and colloquial keywords used in describing pediatric sleep health. We compare our vocabulary against a set of manually annotated clinical notes to determine the effectiveness of our vocabulary for identifying notes with pediatric sleep-related mentions.

Our vocabulary was able to correctly identify clinical notes with pediatric sleep-related mentions with a recall of 0.992 and a precision of 0.852. Most false positives occurred in notes that either explicitly stated no sleep issues or contained text unrelated to patient sleep health (e.g. medication side effects). Among the text spans annotated as sleep-related mentions, 77.1% include at least one keyword from our vocabulary.

Our vocabulary showed excellent performance for identifying pediatric sleep-related mentions at the clinical note level and decent performance for identifying the specific text containing patient mentions. Our low-resource vocabulary, which can be deployed in almost any compute environment, can serve as an identifying first pass over clinical notes to identify which notes or note sections should be further processed by more advanced models or manual annotation review to identify more narrow mentions.

## Full-text entities

- **Diseases:** hypersomnolence (MESH:D006970), PPC (MESH:D003428), enuresis (MESH:D004775), DSE (MESH:C535988), delayed sleep phase syndrome (MESH:D020178), health (OMIM:603663), wheezing (MESH:D012135), obstructive sleep apnea (MESH:D020181), fragmented sleep (MESH:D012892), disturbed (MESH:D014832), sleep difficulties (MESH:D012893), insomnia (MESH:D007319), asthma (MESH:D001249), falls (MESH:C537863)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12920604/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12920604/full.md

---
Source: https://tomesphere.com/paper/PMC12920604