# Standardizing free-text data exemplified by two fields from the Immune Epitope Database

**Authors:** Sebastian Duesing, Jason Bennett, James A. Overton, Randi Vita, Bjoern Peters

PMC · DOI: 10.1186/s13326-025-00324-7 · Journal of Biomedical Semantics · 2025-03-22

## TL;DR

This paper introduces a tool for standardizing free-text data in biomedical databases, improving data usability and searchability.

## Contribution

A generalizable tool and method for free-text normalization in biomedical databases is introduced and evaluated.

## Key findings

- The tool achieved high output validity across character, word, and phrase normalization stages for two IEDB fields.
- Standardization significantly reduced data variance, enhancing findability and enabling ontology linkages.
- Rules for normalization required one-time development effort and can be reused for ongoing curation.

## Abstract

While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): “age” and “data-location” (the part of a paper in which data was found).

Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity.

We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.

## Full-text entities

- **Genes:** HLA-A (major histocompatibility complex, class I, A) [NCBI Gene 3105] {aka HLAA}, WDTC1 (WD and tetratricopeptide repeats 1) [NCBI Gene 23038] {aka ADP, DCAF9}
- **Diseases:** infectious and immune-mediated diseases (MESH:D003141), ICD-9-CM (OMIM:252500), LLMs (MESH:D007806)
- **Chemicals:** acetaminophen (MESH:D000082)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606], Bos taurus (bovine, species) [taxon 9913]
- **Cell lines:** /6 — Homo sapiens (Human), Tongue squamous cell carcinoma, Cancer cell line (CVCL_5985)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11929277/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11929277/full.md

## References

4 references — full list in the complete paper: https://tomesphere.com/paper/PMC11929277/full.md

---
Source: https://tomesphere.com/paper/PMC11929277