Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Tim Kunt; Annika Buchholz; Imene Khebouri; Thorsten Koch; Ida Litzel; and Thi Huong Vu

arXiv:2602.04630·cs.CL·February 5, 2026

Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Tim Kunt, Annika Buchholz, Imene Khebouri, Thorsten Koch, Ida Litzel, and Thi Huong Vu

PDF

Open Access

TL;DR

This paper explores the Web of Science dataset using large language model embeddings to analyze the semantic content of millions of scientific publications and their interconnected graph structure.

Contribution

It introduces a novel approach combining LLM embeddings with graph analysis to map and understand large-scale scientific publication networks.

Findings

01

Revealed a self-structured landscape of scientific texts

02

Demonstrated the effectiveness of LLM embeddings in large datasets

03

Provided insights into the relationships between publications

Abstract

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Text and Document Classification Technologies · Advanced Text Analysis Techniques