OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement
Yang Gao, Ji Ma, Ivan Korotkov, Keith Hall, Dana Alon, Don Metzler

TL;DR
This paper introduces OpenMSD, a large multilingual scientific documents dataset, and develops models for measuring similarity across languages, enhancing the ability to find related scientific papers globally.
Contribution
The work presents the first multilingual scientific documents dataset and proposes models that leverage citation data and language enrichment to improve cross-lingual similarity measurement.
Findings
Models outperform baselines by 7-16% in mean average precision.
Enriching non-English papers with English summaries improves model performance.
OpenMSD contains 74 million papers in 103 languages with 778 million citation pairs.
Abstract
We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized language models, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative language models to enrich the non-English papers with English summaries. This allows us to leverage the models' English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Biomedical Text Mining and Ontologies
