Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study

Sunghoon Kang; Hyewon Park; Ricky Taira; Hyeoneui Kim

PMC · DOI:10.2196/71687·June 10, 2025

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study

Sunghoon Kang, Hyewon Park, Ricky Taira, Hyeoneui Kim

PDF

Open Access

TL;DR

This study introduces a new algorithm for identifying similar health survey questions across languages, improving the standardization of health data.

Contribution

The SBERT-LaBSE algorithm is proposed for cross-lingual semantic similarity detection in health surveys.

Findings

01

SBERT-LaBSE outperformed other models in cross-lingual semantic similarity assessment with high AUC scores.

02

The algorithm effectively aligns semantically equivalent sentences but struggles with subtle nuances and efficiency.

03

Future improvements include testing with larger datasets and score normalization across domains.

Abstract

As the importance of person-generated health data (PGHD) in health care and research has increased, efforts have been made to standardize survey-based PGHD to improve its usability and interoperability. Standardization efforts such as the Patient-Reported Outcomes Measurement Information System (PROMIS) and the National Institutes of Health (NIH) Common Data Elements (CDE) repository provide effective tools for managing and unifying health survey questions. However, previous methods using ontology-mediated annotation are not only labor-intensive and difficult to scale but also challenging for identifying semantic redundancies in survey questions, especially across multiple languages. The goal of this work was to compute the semantic similarity among publicly available health survey questions to facilitate the standardization of survey-based PGHD. We compiled various health survey…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures4

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Interpreting and Communication in Healthcare