Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
Tim Kunt, Annika Buchholz, Imene Khebouri, Thorsten Koch, Ida Litzel, and Thi Huong Vu

TL;DR
This paper explores the Web of Science dataset using large language model embeddings to analyze the semantic content of millions of scientific publications and their interconnected graph structure.
Contribution
It introduces a novel approach combining LLM embeddings with graph analysis to map and understand large-scale scientific publication networks.
Findings
Revealed a self-structured landscape of scientific texts
Demonstrated the effectiveness of LLM embeddings in large datasets
Provided insights into the relationships between publications
Abstract
Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Text and Document Classification Technologies · Advanced Text Analysis Techniques
