Improving average ranking precision in user searches for biomedical   research datasets

Douglas Teodoro; Luc Mottin; Julien Gobeill; Arnaud Gaudinat,; Th\'er\`ese Vachon; Patrick Ruch

arXiv:1709.03061·cs.IR·September 12, 2017

Improving average ranking precision in user searches for biomedical research datasets

Douglas Teodoro, Luc Mottin, Julien Gobeill, Arnaud Gaudinat,, Th\'er\`ese Vachon, Patrick Ruch

PDF

TL;DR

This paper presents a novel ranking pipeline for biomedical dataset search that combines query expansion, relevance-aware similarity measures, and dataset categorization, achieving top performance in the bioCADDIE challenge.

Contribution

The authors introduce a new ranking system integrating word embedding-based query expansion, relevance-sensitive similarity measures, and dataset categorization to improve biomedical dataset retrieval.

Findings

01

Achieved highest infAP in bioCADDIE challenge, +22.3% over median.

02

Query expansion improved performance by up to +5.0% in infAP.

03

Similarity measure demonstrated robustness across training conditions.

Abstract

Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.