# Optimizing Word Embeddings for Patient Portal Message Datasets with a Small Number of Samples

**Authors:** Qingyuan Song, Congning Ni, Jeremy L. Warner, Qingxia Chen, Lijun Song, S. Trent Rosenbloom, Bradley A. Malin, Zhijun Yin

PMC · DOI: 10.21203/rs.3.rs-4350387/v1 · Research Square · 2024-05-15

## TL;DR

This paper introduces a new method to improve word embeddings for patient portal messages when only a small number of samples are available.

## Contribution

The paper proposes PK-word2vec, a novel adaptation of word2vec that incorporates prior knowledge to enhance performance on small-scale patient messages.

## Key findings

- PK-word2vec outperformed standard word2vec in generating more relevant similar words in over 90% of tasks.
- The performance difference between AMT workers and medical students was negligible, indicating consistent evaluation results.
- PK-word2vec effectively learns from a small dataset of 137,554 patient portal messages.

## Abstract

Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small.

We introduce a novel adaptation of the word2vec model, PK-word2vec, for small-scale messages.

PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec on patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks.

The dataset was composed of 1,389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7,981 non-medical and 1,116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p=0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks’ choices between the two groups of reviewers (p = 0.774 under a paired t-test).

PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Diseases:** breast cancer (MESH:D001943)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11118712/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11118712/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC11118712/full.md

---
Source: https://tomesphere.com/paper/PMC11118712