A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art
Alicia Lara-Clares, Juan J. Lastra-D\'iaz, Ana Garcia-Serrano

TL;DR
This paper presents a comprehensive, reproducible survey of biomedical sentence similarity methods, introducing a new string-based approach that outperforms existing models and emphasizing the importance of preprocessing and NER tools.
Contribution
It introduces LiBlock, a novel string-based similarity method, and provides a reproducible experimental framework for evaluating biomedical sentence similarity techniques.
Findings
LiBlock sets new state-of-the-art performance.
Preprocessing and NER tools significantly affect results.
String-based methods outperform most machine learning models.
Abstract
This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most of current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate an unexplored benchmark, called Corpus-Transcriptional-Regulation; (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of reproducibility resources for methods and experiments in this line of research. Our experimental survey is based on a single software platform that is provided with a detailed reproducibility protocol and dataset as supplementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
