# Harmonizing self-reported and free text medication data: a reproducible pipeline for gerontological research

**Authors:** Ramkrishna K. Singh, Chen Chen, Semere Bekena, David C. Brown, Kaylin Taylor, Matthew Blake, Yiqi Zhu, Kebede Beyene, David B. Carr, Ganesh M. Babulal

PMC · DOI: 10.1186/s12911-025-03332-w · BMC Medical Informatics and Decision Making · 2025-12-31

## TL;DR

The paper introduces a reproducible pipeline to standardize and classify unstructured medication data for gerontological research.

## Contribution

A novel, scalable pipeline for harmonizing free-text medication data using deterministic and fuzzy matching techniques.

## Key findings

- A four-phase pipeline successfully standardized 94.2% of medication entries with minimal expert review.
- 444 unique medications were mapped to AHFS classifications, enabling efficient analytical integration.
- The pipeline enhances reproducibility and analytical utility of medication exposure assessments.

## Abstract

Self-reported medication data collected as free text in gerontological and dementia research is often unstructured with inconsistent formatting. These circumstances pose a challenge for standardization and classification when preparing effective, reproducible analyses. Spelling variations, naming conventions, and reporting drug combinations can hinder mapping to standard pharmacologic vocabularies and compromise medication exposure assessments. We aimed to develop and implement a transparent, reproducible, and scalable data harmonization pipeline that ingests free-text medication records and classifies them according to American Hospital Formulary Service (AHFS) therapeutic categories.

A four-phase curation pipeline processed 30,062 Research Electronic Data Capture (REDCap) medication records collected over nearly a decade of annual visits in The Driving Real-world In-Vehicle Evaluation System (DRIVES)Project. In Phase 1, the pipeline standardized medication names using deterministic and fuzzy matching techniques, incorporating Drug-Named-Entity Recognition (DER), the thefuzz Python library, and expert review. Phase 2 mapped drugs to AHFS categories via DrugBank and RxNorm. Phase 3 generated a wide-format dataset with binary class-level exposure indicators. Phase 4 involved a final quality review with auditable documentation.

Out of 30,062 entries, 16,902 eligible prescription entries remained after the removal of vitamins, supplements, and over-the-counter (OTC) drugs. Of these, automated or semi-automated processes successfully standardized 94.2% of entries, with only 5.8% requiring further expert review. A total of 444 unique medications were successfully mapped to AHFS classifications. The curated dataset enables efficient integration into analytical models and supports reproducible assessment of medication exposure.

This pipeline addresses a key methodological challenge in clinical research by providing a reproducible, scalable solution for harmonizing unstructured medication data and enhancing its analytical utility.

The online version contains supplementary material available at 10.1186/s12911-025-03332-w.

## Full-text entities

- **Genes:** NINL (ninein like) [NCBI Gene 22981] {aka NLP}
- **Diseases:** Alzheimer's Disease (MESH:D000544), chronic diseases (MESH:D002908), Memory impairment (MESH:D008569), AHFS (MESH:D003428), REDCap (MESH:D014947), cognitive impairment (MESH:D003072), Dementia (MESH:D003704), ASHP (MESH:C000719191)
- **Chemicals:** Tylenol (MESH:D000082), HCTZ (MESH:D006852), rosuvastatin (MESH:D000068718), amitriptyline (MESH:D000639), fluoxetine (MESH:D005473), lisinopril (MESH:D017706), simvastatin (MESH:D019821), amitriptiline (-), Atorvastatin (MESH:D000069059), lipid (MESH:D008055), donepezil (MESH:D000077265)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12865965/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12865965/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12865965/full.md

---
Source: https://tomesphere.com/paper/PMC12865965