Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study
Sunghoon Kang, Hyewon Park, Ricky Taira, Hyeoneui Kim

TL;DR
This study introduces a new algorithm for identifying similar health survey questions across languages, improving the standardization of health data.
Contribution
The SBERT-LaBSE algorithm is proposed for cross-lingual semantic similarity detection in health surveys.
Findings
SBERT-LaBSE outperformed other models in cross-lingual semantic similarity assessment with high AUC scores.
The algorithm effectively aligns semantically equivalent sentences but struggles with subtle nuances and efficiency.
Future improvements include testing with larger datasets and score normalization across domains.
Abstract
As the importance of person-generated health data (PGHD) in health care and research has increased, efforts have been made to standardize survey-based PGHD to improve its usability and interoperability. Standardization efforts such as the Patient-Reported Outcomes Measurement Information System (PROMIS) and the National Institutes of Health (NIH) Common Data Elements (CDE) repository provide effective tools for managing and unifying health survey questions. However, previous methods using ontology-mediated annotation are not only labor-intensive and difficult to scale but also challenging for identifying semantic redundancies in survey questions, especially across multiple languages. The goal of this work was to compute the semantic similarity among publicly available health survey questions to facilitate the standardization of survey-based PGHD. We compiled various health survey…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Interpreting and Communication in Healthcare
