OpenMSD: Towards Multilingual Scientific Documents Similarity   Measurement

Yang Gao; Ji Ma; Ivan Korotkov; Keith Hall; Dana Alon; Don Metzler

arXiv:2309.10539·cs.CL·September 20, 2023

OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

Yang Gao, Ji Ma, Ivan Korotkov, Keith Hall, Dana Alon, Don Metzler

PDF

Open Access 1 Repo

TL;DR

This paper introduces OpenMSD, a large multilingual scientific documents dataset, and develops models for measuring similarity across languages, enhancing the ability to find related scientific papers globally.

Contribution

The work presents the first multilingual scientific documents dataset and proposes models that leverage citation data and language enrichment to improve cross-lingual similarity measurement.

Findings

01

Models outperform baselines by 7-16% in mean average precision.

02

Enriching non-English papers with English summaries improves model performance.

03

OpenMSD contains 74 million papers in 103 languages with 778 million citation pairs.

Abstract

We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized language models, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative language models to enrich the non-English papers with English summaries. This allows us to leverage the models' English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Biomedical Text Mining and Ontologies