Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)
Sunghoon Kang, Hyeoneui Kim, Hyewon Park, Ricky Taira

TL;DR
This study introduces a language-agnostic BERT sentence embedding method to detect redundant health survey questions across English and Korean, improving semantic similarity assessment and cross-lingual interoperability.
Contribution
The paper presents SBERT-LaBSE, a novel multilingual semantic similarity model that outperforms existing methods in identifying similar health survey questions across languages.
Findings
SBERT-LaBSE achieved over 0.99 AUC in similarity detection.
It effectively identified cross-lingual semantic similarities.
The model outperformed Bag-of-Words and other BERT-based models.
Abstract
The goal of this work was to compute the semantic similarity among publicly available health survey questions in order to facilitate the standardization of survey-based Person-Generated Health Data (PGHD). We compiled various health survey questions authored in both English and Korean from the NIH CDE Repository, PROMIS, Korean public health agencies, and academic publications. Questions were drawn from various health lifelog domains. A randomized question pairing scheme was used to generate a Semantic Text Similarity (STS) dataset consisting of 1758 question pairs. Similarity scores between each question pair were assigned by two human experts. The tagged dataset was then used to build three classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings, and SBRET with LaBSE embeddings. The algorithms were evaluated using traditional contingency statistics. Among the three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsSentence-BERT · Focus
