Leveraging Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study

Tanay Aggarwal; Angelo Salatino; Francesco Osborne; Enrico Motta

arXiv:2508.20693·cs.DL·August 29, 2025

Leveraging Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study

Tanay Aggarwal, Angelo Salatino, Francesco Osborne, Enrico Motta

PDF

Open Access

TL;DR

This study explores how large language models can be used to automatically generate research topic ontologies across multiple disciplines, aiming to improve efficiency and coverage in scientific knowledge organization.

Contribution

The paper introduces PEM-Rel-8K, a new dataset for research relationships, and evaluates LLMs' ability to generate research ontologies across disciplines with fine-tuning and prompting methods.

Findings

01

Fine-tuning LLMs on PEM-Rel-8K achieves high accuracy.

02

Models perform well across disciplines after fine-tuning.

03

Cross-domain transferability of models is promising.

Abstract

Ontologies and taxonomies of research fields are critical for managing and organising scientific knowledge, as they facilitate efficient classification, dissemination and retrieval of information. However, the creation and maintenance of such ontologies are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts. Consequently, ontologies in this space often exhibit uneven coverage across different disciplines, limited inter-domain connectivity, and infrequent updating cycles. In this study, we investigate the capability of several large language models to identify semantic relationships among research topics within three academic domains: biomedicine, physics, and engineering. The models were evaluated under three distinct conditions: zero-shot prompting, chain-of-thought prompting, and fine-tuning on existing ontologies. Additionally, we…

Tables4

Table 1. Table 1 . Sample distributions of PEM-Rel-8K .

	Train	Validation	Test	Total	Percentage
IEEE-Rel-3K	2,240	320	640	3,200	39.6 %
PhySH-Rel-87	613	87	175	875	10.8 %
MeSH-Rel-4K	2,800	400	800	4,000	49.6 %
PEM-Rel-8K	5,653	807	1,615	8,075

Table 2. Table 2 . Overview of the 12 LLMs used in our experiments. The table includes the Model name, the alias adopted in this paper, the number of trainable Parameters , the context Window size, and the rank and scaling factor of the low-rank adaptation matrices used in LoRA ( r and alpha ).

Model	Alias	Parameters	Window	r	alpha
mistral-7b-instruct-v0.3	mistral-7b	7.25B	32K	16	16
Mistral-Nemo-Instruct-2407	mistral-nemo-12b	12.2B	128K	16	16
Mistral-Small-Instruct-2409	mistral-22b	22.2B	128K	16	16
Llama-3.2-3B-Instruct	llama-3b	3.21B	128K	256	128
llama-2-7b-chat	llama-chat-7b	7B	4K	256	128
Meta-Llama-3.1-8B-Instruct	llama-8b	8.03B	128K	256	128
gemma-2b-it	gemma-2b	2.51B	8K	256	128
gemma-2-9b-it	gemma-9b	9.24B	8K	256	128
gemma-2-27b-it	gemma-27b	27.2B	8K	256	128
Phi-3.5-mini-instruct	phi-3	3.82B	128K	256	128
phi-4	phi-4	14.7B	16K	256	128
zephyr-sft	zephyr-7b	7.24B	8K	16	16

Table 3. Table 3 . F1-score, Precision, and Recall for the best-performing model in the 24 experimental configurations. The Approach column denotes the methodological strategy employed: STD refers to Standard Prompting, bCoT to bidirectional Chain-of-Thought, and FT to Fine-Tuning. The Training Set and Test Set columns specify the datasets used for model training and evaluation, respectively, while the Model column identifies the top-performing model in each configuration.

Approach	Training Set	Test Set	F1-Score	Precision	Recall	Model
FT	IEEE-Rel-3K	IEEE-Rel-3K	\ul0.989	0.989	0.989	gemma-27b
FT	MeSH-Rel-4K	IEEE-Rel-3K	0.945	0.947	0.945	gemma-27b
FT	PhySH-Rel-875	IEEE-Rel-3K	0.947	0.947	0.947	phi-4
FT	PEM-Rel-8K	IEEE-Rel-3K	\ul0.973	0.974	0.973	gemma-27b
STD	-	IEEE-Rel-3K	0.716	0.830	0.747	mistral-22b
bCoT	-	IEEE-Rel-3K	0.769	0.809	0.787	mistral-7b
FT	IEEE-Rel-3K	MeSH-Rel-4K	0.782	0.834	0.785	mistral-22b
FT	MeSH-Rel-4K	MeSH-Rel-4K	\ul0.917	0.918	0.917	phi-4
FT	PhySH-Rel-875	MeSH-Rel-4K	0.877	0.877	0.877	gemma-27b
FT	PEM-Rel-8K	MeSH-Rel-4K	\ul0.908	0.911	0.907	gemma-27b
STD	-	MeSH-Rel-4K	0.669	0.766	0.694	gemma-9b
bCoT	-	MeSH-Rel-4K	0.716	0.775	0.725	gemma-9b
FT	IEEE-Rel-3K	PhySH-Rel-875	0.842	0.844	0.865	phi-4
FT	MeSH-Rel-4K	PhySH-Rel-875	0.864	0.870	0.865	phi-4
FT	PhySH-Rel-875	PhySH-Rel-875	\ul0.936	0.946	0.930	gemma-9b
FT	PEM-Rel-8K	PhySH-Rel-875	\ul0.925	0.927	0.925	gemma-27b
STD	-	PhySH-Rel-875	0.666	0.786	0.655	mistral-22b
bCoT	-	PhySH-Rel-875	0.719	0.762	0.705	mistral-7b
FT	IEEE-Rel-3K	PEM-Rel-8K	0.861	0.877	0.863	mistral-22b
FT	MeSH-Rel-4K	PEM-Rel-8K	0.919	0.920	0.919	phi-4
FT	PhySH-Rel-875	PEM-Rel-8K	0.906	0.907	0.906	gemma-27b
FT	PEM-Rel-8K	PEM-Rel-8K	\ul0.935	0.935	0.935	gemma-27b
STD	-	PEM-Rel-8K	0.677	0.779	0.703	mistral-22b
bCoT	-	PEM-Rel-8K	0.730	0.762	0.741	mistral-7b

Table 4. Table 4 . Precision, Recall, and F1-score for models fine-tuned on the PEM-Rel-8K training set and evaluated on the PEM-Rel-8K test set. BR refers to performance on broader relations, NA to narrower , OT to other , and SA to same-as . AVG indicates the average performance across the four categories. The best-performing scores for each relation are highlighted in \ul bold & underlined . Due to space constraints, the leading zero has been omitted from all values.

MODEL	F1-SCORE					PRECISION					RECALL
MODEL	AVG	BR	NR	OT	SA	AVG	BR	NR	OT	SA	AVG	BR	NR	OT	SA
mistral-7b	.906	.905	.895	.947	.877	.907	.917	.931	.932	.848	.906	.893	.861	.963	.909
mistral-nemo-12b	.907	.902	.886	.949	.890	.907	.883	.912	.96	.875	.907	.922	.861	.939	.906
mistral-22b	.922	.921	.911	.958	.899	.924	\ul.963	.938	.945	.852	.923	.883	.885	.971	\ul.953
llama-3b	.862	.856	.857	.905	.830	.867	.813	.820	.957	.880	.861	.902	.898	.859	.784
llama-chat-7b	.887	.891	.865	.932	.861	.888	.915	.877	.911	.847	.888	.868	.854	.954	.875
llama-8b	.910	.909	.895	.946	.889	.910	.900	.918	.936	.886	.910	.919	.873	.956	.891
gemma-2b	.892	.896	.868	.932	.872	.895	.888	\ul.943	.921	.829	.893	.905	.805	.944	.919
gemma-9b	.926	.919	.917	.958	.909	.926	.919	.932	.943	\ul.909	.926	.919	.902	\ul.973	.909
gemma-27b	\ul.935	\ul.935	\ul.922	\ul.966	\ul.916	\ul.935	.961	.920	.966	.894	\ul.935	.910	\ul.924	.966	.940
phi-3	.899	.894	.892	.928	.883	.900	.867	.898	.961	.875	.899	.922	.885	.898	.891
phi-4	.918	.913	.914	.950	.895	.919	.900	.925	\ul.970	.880	.918	\ul.927	.902	.932	.912
zephyr-7b	.915	.911	.898	.951	.900	.915	.902	.916	.951	.891	.915	.919	.880	.951	.909
AVG	.906	.904	.893	.943	.885	.907	.902	.910	.946	.872	.906	.907	.877	.942	.899

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Text Analysis Techniques · Semantic Web and Ontologies

Full text

\useunder

\ul

Leveraging Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study

Tanay Aggarwal

[email protected]

0009-0009-9477-7112

Knowledge Media Institute, The Open UniversityMilton KeynesUK

,

Angelo Salatino

[email protected]

0000-0002-4763-3943

Knowledge Media Institute, The Open UniversityMilton KeynesUK

,

Francesco Osborne

0000-0001-6557-3131

[email protected]

Knowledge Media Institute, The Open UniversityMilton KeynesUK

Department of Business and Law, University of Milano BicoccaMilanIT

and

Enrico Motta

[email protected]

0000-0003-0015-1952

Knowledge Media Institute, The Open UniversityMilton KeynesUK

Abstract.

Ontologies and taxonomies of research fields are critical for managing and organising scientific knowledge, as they facilitate efficient classification, dissemination and retrieval of information. However, the creation and maintenance of such ontologies are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts. Consequently, ontologies in this space often exhibit uneven coverage across different disciplines, limited inter-discipline connectivity, and infrequent updating cycles. In this study, we investigate the capability of several large language models to identify semantic relationships among research topics within three academic disciplines: biomedicine, physics, and engineering. The models were evaluated under three distinct conditions: zero-shot prompting, chain-of-thought prompting, and fine-tuning on existing ontologies. Additionally, we assessed the cross-discipline transferability of fine-tuned models by measuring their performance when trained in one discipline and subsequently applied to a different one. To support this analysis, we introduce PEM-Rel-8K, a novel dataset consisting of over 8,000 relationships extracted from the most widely adopted taxonomies in the three disciplines considered in this study: MeSH, PhySH, and IEEE. Our experiments demonstrate that fine-tuning LLMs on PEM-Rel-8K yields excellent performance across all disciplines.

Ontologies of Research Topics, Large Language Models, Knowledge Organization Systems, KOSs, Ontology Generation, Thesaurus.

††copyright: none

1. Introduction

Ontologies and taxonomies of research fields are essential for structuring scientific knowledge, as they support the effective classification, dissemination, and retrieval of information (Dunne and Hulek, 2020; Lipscomb, 2000; Rous, 2012). These knowledge bases are widely used to describe and categorise research outputs, including scientific articles, research projects, patents, datasets, and software. Moreover, they play a pivotal role in the evaluation and profiling of universities, research organisations, groups, and individual scholars (Rahdari et al., 2021; Wang et al., 2020). In addition, these ontologies are fundamental components of intelligent systems that operate on academic literature (Osborne et al., 2013; Beel et al., 2016; Gusenbauer and Haddaway, 2020), such as search engines (Gusenbauer and Haddaway, 2020), conversational agents (Meloni et al., 2023), analytics dashboards (Angioni et al., 2021), and recommender systems (Beel et al., 2016).

A recent survey identified 45 such knowledge organization systems (KOSs)—including ontologies, taxonomies, and thesauri—spanning most academic disciplines and highlighted several systemic issues within this landscape (Salatino et al., 2025). First, the coverage provided by these KOSs is highly uneven, with some disciplines either entirely absent or only partially represented. Second, these resources are weakly interconnected, which limits their capacity to effectively capture research activities that emerge from interdisciplinary work. Third, many of these KOSs are updated infrequently, failing to reflect recent developments, which often constitute the most dynamic and impactful aspects of scientific progress. These shortcomings result in a limited representation of the scientific landscape, which in turn restricts the dissemination of research outputs and undermines the effectiveness of systems for literature analysis and exploration.

The root cause of many of these issues is that the creation and maintenance of such KOSs are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts over a prolonged period (Osborne and Motta, 2015). In previous years, several approaches have been proposed to automate or semi-automate the ontology generation process (Sanderson and Croft, 1999; Osborne and Motta, 2012, 2015; Han et al., 2020). Despite these advancements, existing methods still encounter significant challenges in producing large-scale and fine-grained representations of research topics, due to the discipline’s inherent complexity and specialisation. Consequently, with a few notable exceptions, such as the Computer Science Ontology (Salatino et al., 2020), most ontologies in this field are still developed predominantly through manual processes.

Over the past three years, Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), enhancing the ability of machines to understand and generate human language. In this context, several studies have explored the use of LLMs for ontology generation, producing highly promising outcomes (Babaei Giglou et al., 2023; Aggarwal et al., 2026; Fathallah et al., 2024a; Lippolis et al., 2025).

In this paper, we investigate the effectiveness of a broad range of LLMs and optimisation techniques in identifying semantic relations between pairs of research topics, a task that is central to the construction of academic ontologies. We focus on three strategies for adapting LLMs to this problem: zero-shot settings (ZSL), Chain-of-Thought (CoT) prompting, and fine-tuning on relationships drawn from existing ontologies. Our evaluation considers 12 open-weight LLMs, with sizes ranging from 3 billion to 27 billion parameters, and examines their ability to automatically detect three types of semantic relations: broader, narrower, and same-as.

To support this analysis, we introduce PEM-Rel-8K, a multi-disciplinary benchmark including over 8,000 relationships drawn from three widely used ontologies: IEEE, which covers engineering and computer science; PhySH, which focuses on physics; and MeSH, the primary knowledge base used to categorise biomedical research. This novel resource also enabled us to explore the effects of fine-tuning a model on one scientific discipline and testing it on others, yielding valuable insights into the cross-discipline transferability of fine-tuning in this context. To the best of our knowledge, this is the first large-scale study to investigate the use of LLMs for identifying semantic relationships between research topics across a wide range of scientific disciplines.

Our experiments demonstrate that fine-tuning LLMs on PEM-Rel-8K yields excellent performance, significantly surpassing alternative approaches such as CoT prompting. Among the evaluated models, the fine-tuned gemma-2-27b-it achieved the highest F1-Score (93.5%), followed by gemma-2-9b-it and Mistral-Small-Instruct-2409. Notably, the best-performing LLM fine-tuned on PEM-Rel-8K achieved an average F1-Score across the three disciplines that was only 1.2% lower than that of models fine-tuned separately on discipline-specific training sets. Furthermore, our analysis of cross-discipline transferability shows that LLMs trained on one discipline can generalise effectively to others, maintaining competitive performance across different disciplines.

In summary, the contributions of this paper are as follows:

•

We present a comprehensive analysis of the capability of twelve LLMs to identify semantic relationships between research topics across three scientific disciplines: engineering (IEEE), physics (PhySH), and biomedicine (MeSH).

•

We study different optimisation techniques, including ZSL, CoT reasoning, and fine-tuning.

•

We fine-tune the models on individual disciplines as well as on a hybrid training set that integrates all of them, enabling the analysis of cross-discipline adaptability.

•

We introduce and publicly release PEM-Rel-8K, a novel modular benchmark designed for training and evaluating models on this task.

•

We provide the complete codebase for our analysis111The datasets and the code for our experiments are available at: https://github.com/ImTanay/LLM-Multi-Domain-Ontology.

The remainder of this paper is organised as follows. Section 2 reviews the relevant literature. Section 3 defines the task, introduces the novel PEM-Rel-8K dataset, and describes the LLMs under analysis. Section 4 outlines the experimental design and implementation setup. Section 5 presents and discusses the results. Finally, Section 6 concludes the paper and suggests directions for future research.

2. Related Work

We begin by examining well-known ontologies of research areas (Section 2.1), and then discuss various approaches for generating such ontologies (Section 2.2).

2.1. Knowledge Organization Systems of Research Areas

KOSs are formal frameworks designed to structure information and enable efficient knowledge management and retrieval (Zeng, 2008; Mazzocchi, 2018). They are widely used to describe research areas and the relationships among them across a variety of repositories and digital libraries (Salatino et al., 2025). Depending on their structural complexity and available features, such as hierarchical depth or synonym control, these systems generally fall into four categories: term lists, taxonomies, thesauri, and ontologies. A term list is a flat, non-hierarchical collection of subject headings or descriptors used to index document sets, without explicit semantic relations among its entries (Hedden, 2010; Zaharee, 2013). A taxonomy introduces structure by organising classes hierarchically through parent-child relationships (Rasch, 1987). It typically follows a tree structure, starting from a root node and branching into progressively more specific subclasses. A thesaurus extends the taxonomic model by incorporating additional descriptive properties, including definitions, associative links, and synonyms (ANSI/NISO Z39.19-2005(R2010), 2010). Finally, an ontology provides a formal and explicit specification of a conceptual domain, classifying entities according to their defining attributes (Gruber, 1993). Ontologies represent the most functionally complete form of KOS, as they describe concepts, entities, and their relations (Genesereth and Nilsson, 2012). They support advanced semantic capabilities, including synonym resolution, property definition, and the representation of diverse relationship types (Zeng, 2008). Within the scholarly domain, these frameworks are essential for the classification and retrieval of research outputs, such as publications and datasets. The objective of this study is to assess the ability of LLMs to identify semantic relations, thereby supporting the construction and refinement of ontologies for the representation of research areas.

In this paper, we focus on three well-known academic KOSs: Medical Subject Headings, the IEEE Thesaurus, and PhySH (Lipscomb, 2000; Rous, 2012; Salatino et al., 2018; OpenAlex, 2024). The Medical Subject Headings (MeSH) is an ontology that describes the field of medicine with over 31K headings (Lipscomb, 2000). It is maintained by the US National Library of Medicine, and it is primarily employed to organise medical publications in the Medline database. The IEEE Thesaurus includes approximately 12,000 curated terms relevant to electrical and electronics engineering. Developed and maintained by the IEEE, it plays a crucial role in classifying research outputs within IEEE’s digital repositories. PhySH (Physics Subject Headings) offers about 3,700 concepts related to physics and is primarily used to index papers in Physical Review and on arXiv (Smith, 2020).

In addition to specialised ontologies, several multidisciplinary ones are available. These include the UNESCO Thesaurus and the ANZSRC Fields of Research (FoR), each comprising approximately 4,400 concepts; EuroVoc, which contains around 7,000; and OpenAlex Topics, which covers about 4,800 subjects.

A recent survey (Salatino et al., 2025) identified 45 ontologies of research topics, highlighting a highly fragmented landscape. Notably, the survey found that no single ontology is openly available, fine-grained, and comprehensive across all research disciplines. This limitation is largely attributed to the reliance on manual curation, which requires significant time, expertise, and financial investment. Consequently, automated methods for generating research topic ontologies have gained increasing attention as a viable alternative. In this work, we contribute to this direction by proposing an approach for automatically identifying semantic relations between pairs of research topics.

2.2. Automatic Generation of Ontologies

Early semi-automatic approaches to ontology generation integrated expert input with linguistic and statistical tools (Maedche and Staab, 2001). With the advancement of NLP and machine learning techniques, these methods evolved into more sophisticated systems, such as Text2Onto (Cimiano and Völker, 2005) and OntoLearn (Navigli et al., 2004), which sought to minimise manual effort while maintaining a degree of human oversight. In recent years, the emergence of deep learning models such as BERT (Devlin et al., 2018) has led to substantial improvements in concept extraction (Grootendorst, 2022) and the modelling of hierarchical relationships (Chen et al., 2020; Pisu et al., 2024).

LLMs have recently demonstrated significant potential in ontology generation. For instance, the LLMs4OL approach has achieved strong performance across a variety of datasets, including lexicosemantic knowledge, geographical information, and medical data (Babaei Giglou et al., 2023). Nonetheless, several studies have revealed persistent challenges related to consistency and accuracy, underscoring the necessity of human oversight during the formalisation process (Saeedizade and Blomqvist, 2024). (Fathallah et al., 2024a) introduced NeOn-GPT, a workflow that applies the NeOn methodology to ontology generation. Building on this work, LLMs4Life (Fathallah et al., 2024b) adapted NeOn-GPT for the development of life science ontologies. Their study emphasised the need for more context-rich prompts and demonstrated that the involvement of domain experts significantly enhances the quality of the results. In a related effort, (Lippolis et al., 2025) assessed the use of LLMs in ontology design, aiming to reduce manual effort and support less experienced engineers. While their results confirmed the utility of LLMs as assistive tools, they also highlighted issues with consistency that require subsequent post-processing. In order to address these challenges, recent studies have proposed a range of human-in-the-loop strategies that investigate different modes of collaboration between LLMs and domain experts (Tsaneva et al., 2025).

Overall, while LLMs show promise for this task, they still face significant limitations in generating complete, high-quality ontologies that can match those created by domain experts, particularly in specialised domains such as the classification of scientific publications (Sun et al., 2024).

The community focused on generating ontologies of research topics has developed several promising (semi-)automated methods. A notable example is Klink-2 (Osborne and Motta, 2015), which enabled the construction of the Computer Science Ontology (Salatino et al., 2018). Klink-2 identifies both hierarchical and synonymous relationships by leveraging co-occurrence patterns, topic similarity, and subsumption logic. (Shen et al., 2018) applied an extended subsumption technique to build Microsoft’s Field of Science, an ontology containing over 200,000 concepts that was later integrated into the Microsoft Academic Graph. More recently, the OpenAlex team expanded the ASJC taxonomy by identifying more than 4,500 research topics using citation clustering, which they subsequently labelled using LLMs (OpenAlex, 2024). A similar methodology was independently adopted by (Jenset et al., 2025), who developed a taxonomy of 29,000 research concepts and mapped it to the ANZSRC Fields of Research. Furthermore, ontology evaluation methods have been applied to enrich existing ontologies with additional research topics based on various requirements (Kotis et al., 2020; Osborne and Motta, 2018).

In conclusion, while LLMs have been employed to support a range of academic tasks, such as paper discovery (Chow et al., 2024), citation prediction (Buscaldi et al., 2024; Hao et al., 2024), scientific question answering (Auer et al., 2023; Meloni et al., 2025), and literature review generation (Bolanos et al., 2024; Scherbakov et al., 2024), their use for generating research topic ontologies has received limited attention. This study aims to fill this gap by systematically evaluating recent LLMs in terms of their ability to infer semantic relationships between research topics.

3. Background

This section formalises the task under investigation (Section 3.1), introduces the PEM-Rel-8K dataset (Section 3.2), and provides an overview of the LLMs used in our experiments (Section 3.3).

3.1. Task definition

Identifying and formalising semantic relationships among research topics is essential for building academic KOSs, which play a key role in organising the scientific literature and improving information retrieval. These representations underpin many AI222AI - Artificial intelligence systems for paper recommendation and research analysis, enable digital libraries and search engines to move toward robust semantic search, and support scientometric studies, for example, in the assessment of scientific impact and the forecasting of trends (Salatino et al., 2025).

In this paper, we address the task of identifying the semantic relationship between a given pair of research topics. More precisely, we formalise it as a single-label, multi-class classification problem, where each input pair of research topics, denoted as $t_{A}$ and $t_{B}$ , is assigned to exactly one of the following mutually exclusive categories:

•

broader: $t_{A}$ is a broader topic that subsumes the more specific topic $t_{B}$ . For example, databases subsumes distributed databases.

•

narrower: $t_{A}$ is a more specific topic subsumed by the broader topic $t_{B}$ . For example, adaptive signal processing is subsumed by signal processing. This is the inverse relationship of broader.

•

same-as: $t_{A}$ and $t_{B}$ are semantically equivalent and can be used interchangeably across a broad range of information retrieval tasks. For example, ontology alignment and ontology matching.

•

other: in contrast with the previous three categories, this category does not define a semantic relation. Its purpose is simply to provide the classifier with a mechanism to label negative examples. Without it, the classifier would be forced to assign one of the three predefined semantic relationships even when none actually applies to a given pair of topics.

The first three relationships are widely employed in the construction of ontologies for research topics (Smith, 2020), as they are crucial for representing hierarchical structures as well as handling synonymous terms (otherwise known as synonym rings). In practice, although we adopt simplified labels for these relations, they correspond directly to the standard Simple Knowledge Organization System (SKOS)333SKOS: Simple Knowledge Organization System — https://www.w3.org/TR/skos-reference properties skos:narrower, skos:broader, and skos:exactMatch. This design choice aligns with our primary objective of constructing research topic ontologies using SKOS as the underlying data model. Conversely, as previously noted, the other category does not denote a formal semantic relationship and it simply provides a mechanism to avoid forced misclassification.

It is important to acknowledge that other semantic relationships, such as part-of, instance-of, has-attribute, and has-process, also play an important role in ontology development. However, this paper focuses on the core structural foundation of a research topics ontology, which is captured by the three relationships defined above.

3.2. The PEM-Rel-8K dataset

The aim of this study is to evaluate model performance in a multidisciplinary context and to investigate the cross-discipline adaptability of models fine-tuned on data from a single discipline when applied to others. To this end, we sampled semantic relationships from three KOSs, each representing a distinct scientific discipline: the IEEE Thesaurus (Engineering), PhySH (Physics), and MeSH (Biomedicine). We first constructed a dedicated dataset for each taxonomy, labelled IEEE-Rel-3K, PhySH-Rel-875, and MeSH-Rel-4K, and then merged them into a unified benchmark named PEM-Rel-8K.

The following sections detail the construction of each dataset and the subsequent integration into the final benchmark.

In the engineering discipline, we constructed IEEE-Rel-3K by sampling 3,200 semantic relationships from the IEEE Thesaurus444IEEE Thesaurus - https://www.ieee.org/publications/services/thesaurus.html. Specifically, we used the latest available RDF version of the Thesaurus (v1.02)555IEEE Thesaurus (RDF) - https://github.com/angelosalatino/ieee-taxonomy-thesaurus-rdf. We randomly selected 800 examples each for the broader and narrower relationships. Capturing instances of same-as was more challenging and required manual supervision, as the IEEE Thesaurus does not explicitly provide this relation. Instead, it relies on skos:prefLabel and skos:altLabel to indicate both related terms and synonyms. For example, ‘4G mobile communication’ is listed as a skos:altLabel of ‘5G mobile communication’. To address this, three experts manually analysed the set of topics connected through skos:prefLabel and skos:altLabel, and extracted 800 pairs of terms that were deemed to be true synonyms, lexical variants, or near-synonyms. These pairs were then annotated with the same-as relation. Finally, to produce the other pairs, we randomly generated 800 topic pairs that do not share any semantic link in the original ontology.

In the physics discipline, we developed the PhySH-Rel-875 dataset by extracting 875 semantic relationships from the Physics Subject Headings666Physics Subject Headings (PhySH) - https://github.com/physh-org/PhySH, which are provided in RDF format and adhere to the SKOS standard. From the April 2024 release, we randomly sampled 250 semantic relationships for each of the broader and narrower properties. Similar to the IEEE taxonomy, PhySH does not explicitly include synonyms but uses the skos:altLabel property for alternative labels. As in the IEEE case, many topic pairs did not meet the criteria for the same-as relation. For instance, algebraic structure is listed as an skos:altLabel of abstract algebra. Three experts reviewed the 608 available skos:altLabel entries and were able to validate 125 instances as same-as relations. Finally, we generated 250 other pairs using the same procedure adopted for the IEEE dataset. This is the only imbalanced dataset in our collection. The imbalance is due to the difficulty of identifying a sufficient number of reliable same-as relations, combined with the intention to maintain a good sample size for the other relation types.

In the biomedical domain, we developed MeSH-Rel-4K, a dataset comprising 4,000 semantic relationships extracted from Medical Subject Headings777Medical Subject Headings (MeSH) - https://id.nlm.nih.gov/mesh/. The latest version of MeSH includes more than 31K subject headings described using a custom schema888MeSH Schema - http://id.nlm.nih.gov/mesh/vocab (mesh:). In this schema, each topical subject heading is modelled as a mesh:TopicalDescriptor. These topics are connected by approximately 42K hierarchical relationships, expressed by mesh:broaderDescriptor. In addition, MeSH includes about 8K associative, non-hierarchical links between concept records, expressed by mesh:relatedConcept, which connect semantically related concepts.

We extracted the relationships for the MeSH-Rel-4K dataset from the January 2025 release of MeSH. Specifically, we randomly sampled 1,000 broader and 1,000 narrower relationships, both derived from the mesh:broaderDescriptor property. We then sampled 1,000 same-as instances using the mesh:relatedConcept property. Finally, we manually curated 1,000 pairs of semantically unrelated topics to represent the other category.

Finally, we combined the three previously described datasets to construct PEM-Rel-8K. This multi-disciplinary dataset comprises over 8,000 relationships spanning the three scientific disciplines. Each single-discipline dataset was divided into training, validation, and test sets following a 7:1:2 ratio. Table 1 presents the sizes of both the individual and the combined datasets.

All datasets, along with the code used for their construction, are available at: https://github.com/ImTanay/LLM-Multi-Domain-Ontology.

3.3. Large Language Models

In this study, we evaluated twelve decoder-only LLMs. Table 2 presents an overview of the LLMs, listing the model names, the shorter aliases used throughout this paper, the number of parameters, the context window sizes, and the Low-Rank Adaptation (LoRA) (Hu et al., 2021) parameters employed for their fine-tuning:

i) r, the rank of the update matrices (lower values produce smaller update matrices with fewer trainable parameters), and ii) alpha, the LoRA scaling factor.

The selected models vary in size and represent several prominent families: Mistral (Jiang et al., 2023) (three models), Llama (AI@Meta, 2024) (three models), Gemma (Team et al., 2024) (three models), Phi (Abdin et al., 2024a, b) (two models), and Zephyr (Han, 2024) (one model). The number of parameters ranges from 2.51 billion (gemma-2b, the smallest) to 27.2 billion (gemma-27b, the largest). All models were quantised to 4-bit precision and are publicly accessible via HuggingFace.

4. Experimental Methodology

We conducted an extensive series of experiments to evaluate the performance of LLMs on the PEM-Rel-8K benchmark. A distinguishing characteristic of this benchmark is its modular structure, which enables training on each of the three discipline-specific datasets individually (IEEE-Rel-3K, MeSH-Rel-4K, and PhySH-Rel-875) as well as on their combined form. Testing can likewise be performed on each of the individual datasets or on the aggregate.

We explored two main evaluation settings: zero-shot learning and fine-tuning.

In the zero-shot learning setting, we examined model performance using two different prompting strategies. Evaluations were carried out on the multidisciplinary test set as well as on each of the three discipline-specific test sets.

In the fine-tuning setting, we systematically assessed three conceptually distinct scenarios:

(1)

Discipline-specific evaluation: The LLM is fine-tuned and tested within the same discipline (e.g., fine-tuned on the training set of MeSH-Rel-4K and evaluated on the test set of MeSH-Rel-4K), a setting expected to yield the highest performance due to discipline alignment. 2. (2)

Cross-discipline evaluation: The LLM is fine-tuned on one discipline and tested on another (e.g., fine-tuned on the training set of MeSH-Rel-4K and evaluated on the test set of PhySH-Rel-875) as well as on the entire PEM-Rel-8K to assess cross-discipline transferability. 3. (3)

Multidisciplinary evaluation: The LLM is fine-tuned using the training set of PEM-Rel-8K and evaluated on its corresponding test set as well as on the test sets of three discipline-specific datasets to assess its robustness across different disciplines.

The resulting experimental framework includes 24 distinct combinations of train and test sets. Each was applied across 12 different LLMs, resulting in a total of 288 experimental runs. Table 1 summarises the sizes of the datasets used in these experiments and their partitioning.

We evaluated the performance of all models using macro-averaged precision, recall, and F1-score, as these are standard metrics for classification tasks.

In the next two subsections, we provide a detailed description of the procedures used in both the zero-shot and fine-tuning experiments. Additional technical specifications, including the libraries and hardware used, are provided in Section 4.3.

4.1. Zero-Shot Prompting Strategies

We implemented two prompting strategies: standard prompting and bidirectional CoT prompting (Aggarwal et al., 2026).

The standard prompting, employed as baseline, generates prompts for each pair of research topics via a predefined template (available in the GitHub999Our prompts - https://github.com/ImTanay/LLM-Multi-Domain-Ontology repository). This template outlines the task, defines the four relationships, and mandates a specific format for the output to facilitate parsing. To ensure a fair comparison, we employed the same prompt for all models.

Bidirectional CoT prompting, introduced by (Aggarwal et al., 2026), was included in this study due to its strong performance in classifying semantic relationships between research topics. This technique builds upon CoT prompting, which has shown effectiveness across a broad range of complex tasks (Wei et al., 2023; Kojima et al., 2023). The approach involves two sequential prompts. The first prompt asks the model to provide definitions for both topics, construct a sentence incorporating them, and reflect on their potential semantic relationship. The second prompt uses the response generated by the first, adds instructions for the classification task, and presents it back to the model. This process is repeated with the positions of the two topics swapped to achieve bidirectionality. Finally, a rule-based referee is applied to resolve any discrepancies between the outcomes of the two runs. All prompt templates and referee rules follow those described in the original paper (Aggarwal et al., 2026).

In total, the combination of two strategies, 12 LLMs (Table 2), and four datasets (Table 1) resulted in 96 experiments.

4.2. Fine-Tuning

The fine-tuned models were obtained by training the 12 LLMs on the four datasets: PEM-Rel-8K, IEEE-Rel-3K, MeSH-Rel-4K, and PhySH-Rel-875. This process yielded a total of 48 fine-tuned models. To ensure compatibility and comparability of results, we consistently used the same training and validation sets (split 7:1:2, as detailed in Section 3.2) and employed a uniform prompt-based fine-tuning approach. This involved formatting each pair of research topics into a conversational prompt using the following predefined template:

Classify the relationship between ‘[TOPIC-A]’ and ‘[TOPIC-B]’

In this format, the placeholders ‘[TOPIC-A]’ and ‘[TOPIC-B]’ were dynamically replaced with the corresponding surface forms of topics $t_{A}$ and $t_{B}$ , as follows:

user: Classify the relationship between ‘Biology’ and ‘Genetics’ model: relationship: broader

During training and validation, we explicitly included the expected output in the format (relationship: [RELATIONSHIP-TYPE]) to guide the model towards producing structured and easily parsable responses. This design ensured consistency and simplified the extraction of predicted labels. The resulting 48 models were then tested on the four datasets to examine all the evaluation scenarios discussed previously, constituting 192 distinct experiments.

4.3. Experimental Setup

To facilitate the fine-tuning and interaction with the set of LLMs reported in Table 2, we employed two open-source libraries: KoboldAI101010KoboldAI - https://github.com/KoboldAI/KoboldAI-Client and Unsloth (Daniel Han and team, 2023).

KoboldAI, is an open-source platform built on top of llama.cpp111111llama.cpp - https://github.com/ggml-org/llama.cpp that provides API-based access to LLMs hosted locally. We used KoboldAI to interact with the 12 models in a zero-shot setting.

Unsloth was employed for fine-tuning and transfer learning experiments. It is an open-source Python library built on top of the Hugging Face Transformers library (Wolf et al., 2020) and PEFT (Parameter-Efficient Fine-Tuning) method (Mangrulkar et al., 2022). Unsloth supports 4-bit quantised training via BitsAndBytes121212BitsAndBytes - https://github.com/bitsandbytes-foundation/bitsandbytes, which significantly reduces memory usage and enables training on consumer-grade GPUs. It also integrates LoRA (Hu et al., 2021), allowing parameter-efficient fine-tuning by injecting lightweight trainable adapter layers into frozen pre-trained models.

All experiments were conducted on Google Colaboratory instances equipped with NVIDIA A100 and L4 GPUs. To ensure transparency and reproducibility, the complete codebase is available in the GitHub repository131313Our code – https://github.com/ImTanay/LLM-Multi-Domain-Ontology. It includes all scripts for prompting and fine-tuning, along with detailed configuration parameters for both Unsloth and KoboldAI.

5. Results

In this section, we present and discuss the experimental results. We begin with zero-shot learning, followed by the three evaluation scenarios involving fine-tuning. Finally, we compare different models by considering the most comprehensive case, in which they were both fine-tuned and evaluated on the full PEM-Rel-8K dataset.

Table 3 reports the best-performing models and their results across the 24 experimental combinations. Among the 12 models evaluated, only five achieved top performance: gemma-27b, which ranked first in eight cases; mistral-22b and phi-4, each leading in five cases; and mistral-7b and gemma-9b, each with three top results.

5.1. Zero-shot Strategies

Bidirectional CoT consistently outperformed standard prompting across all evaluated models and datasets, in agreement with previous findings (Wei et al., 2023; Aggarwal et al., 2026). Specifically, the CoT approach resulted in an average F1-score improvement of 5.1% (standard deviation $\pm 0.3$ %) compared to simple prompting across the four datasets. When applied to the full datasets, CoT achieved a robust F1 score of 73.0%, exceeding the performance of simple prompting by 5.3 percentage points.

In the experiment using simple prompting, mistral-22b and gemma-9b achieved the highest performance. However, when applying bidirectional CoT, the smaller mistral-7b outperformed the other models on three out of four datasets (IEEE-Rel-3K, PhySH-Rel-875, and PEM-Rel-8K). These results underscore the importance of structured prompting techniques, particularly when deploying smaller and more cost-efficient models.

5.2. Fine-Tuning

In the following, we discuss the three evaluation scenarios presented in Section 4: discipline-specific evaluation, cross-discipline evaluation, and multidisciplinary evaluation.

5.2.1. Discipline-specific evaluation.

The highest performance on each of the three discipline-specific test sets was achieved by models fine-tuned on the corresponding training data. However, performance varied substantially across the test sets, and the top-performing model was different for each one. Specifically, gemma-27b achieved the highest overall F1-score on IEEE-Rel-3K (98.9% F1), while gemma-9b delivered the best performance on PhySH-Rel-875 (93.6%), and phi-4 performed best on MeSH-Rel-4K (91.7%).

The excellent results on IEEE-Rel-3K may be attributed to the nature of the relevant topics in electrical engineering and computer science, which are likely to be well represented in the LLM’s training data, such as code repositories, technical documents, manuals, and patents. Notably, fine-tuning on the much smaller PhySH-Rel-875 still produced a high F1-score, demonstrating the effectiveness of fine-tuning even when using a relatively small training set.

5.2.2. Cross-discipline evaluation.

The cross-discipline experiments demonstrate the generalisation capabilities of LLMs when fine-tuned on one discipline and evaluated on another. This behaviour is analogous to transfer learning, where knowledge acquired from solving one task or operating in one domain is applied to enhance performance on different but related tasks (Zhuang et al., 2020).

Fine-tuning models on any single discipline resulted in highly competitive performance. In particular, for each discipline, the best cross-discipline model achieved F1 scores that were, on average, only 5.1% (standard deviation $\pm 1.7$ %) F1 points lower than those of the model trained directly on the same discipline. Moreover, these models provided a substantial average improvement of 16.1 percentage points (standard deviation $\pm 1.6$ %) over the best zero-shot strategy.

The optimal cross-discipline training sets and models varied depending on the test set. When evaluating on IEEE-Rel-3K, the best-performing solution was phi-4 fine-tuned on PhySH-Rel-875 (94.7% F1). For MeSH-Rel-4K, the top model was gemma-27b fine-tuned on PhySH-Rel-875 (87.7% F1). Finally, for the PhySH-Rel-875 test set, the best-performing model was phi-4 fine-tuned on MeSH-Rel-4K (86.4% F1).

The results confirm that models fine-tuned on certain STEM disciplines can be successfully applied to others, suggesting that it is not necessary to use discipline-specific datasets to achieve solid performance in this task. However, these findings should be further validated across disciplines that are typically classified into distinct domains, such as physics and the social sciences.

5.2.3. Multidisciplinary evaluation.

The purpose of this evaluation is to determine whether models trained on the multidisciplinary PEM-Rel-8K dataset yield consistently strong results across different disciplines. The results support this hypothesis, as models trained on the full dataset performed excellently on the three discipline-specific test sets. Their F1-scores were only 1.2% lower than those of models trained exclusively on discipline-specific datasets. In all cases, the best-performing model was gemma-27b, which achieved F1-scores of 97.3% on IEEE-Rel-3K, 90.8% on MeSH-Rel-4K, and 92.5% on PhySH-Rel-875.

These findings confirm that PEM-Rel-8K can be used to develop robust models that generalise well across multiple disciplines and are suitable for producing multidisciplinary ontologies.

When considering the full PEM-Rel-8K test set, the best solution was again gemma-27b fine-tuned on the same dataset, achieving an F1-score of 93.5%. Notably, phi-4, trained on MeSH-Rel-4K, also produced competitive results with an F1-score of 91.9%. This further reinforces the conclusions of the cross-discipline evaluation.

5.3. Comparing LLMs on PEM-Rel-8K

As observed in previous analyses, fine-tuning on PEM-Rel-8K yielded strong results across all disciplines. This section presents a more detailed investigation of the LLMs that were fine-tuned and evaluated using the full PEM-Rel-8K dataset.

Table 4 reports the precision, recall, and F1-score for all evaluated models. It also provides a breakdown of these metrics across the four categories: the three relations and the other category that captures cases in which the two topics are not connected by any of the defined relations.

In line with the broader findings presented in Table 3, gemma-27b emerges as the best-performing model overall, achieving a F1-score of 93.5%. Several other models also demonstrate strong performance on this benchmark, notably gemma-9b (92.6% F1), mistral-22b (92.2% F1), and phi-4 (91.8% F1).

Overall, gemma-27b achieves the highest F1-score across all relation types and also obtains the highest recall for narrower (92.4%). However, some models demonstrate notable strengths in specific relations. For example, mistral-22b achieves the highest precision for broader (96.3%) and the best recall for same-as (95.3%). gemma-2b attains the highest precision for narrower (94.3%), while gemma-9b excels in precision for same-as (90.9%) and recall for other (97.3%). Furthermore, phi-4 obtains the best precision for other (97.0%) and achieves the highest recall for broader (92.7%).

The four categories achieve comparable average F1-scores across the models, with values ranging from 88% to 94%. However, other consistently achieves the highest average scores across all metrics (F1-score: 94.3%, precision: 94.6%, and recall: 94.2%). Although other is not a formal semantic relation, as discussed in Section 3.1, these strong results indicate that LLM-based systems are very effective at identifying cases in which pairs of topics are not connected by any of the predefined relations. This capability is particularly valuable, as incorrectly identified relations, especially hierarchical ones, can compromise the quality of the resulting ontology by introducing cycles or enabling incorrect inferences (Osborne and Motta, 2015).

Notably, same-as proves the most challenging relation, recording the lowest F1-score (average 88.5%) and precision (average 87.2%). This relatively lower performance can be attributed to the inherent difficulty of defining same-as with precision, as even different ontologies and thesauri across various disciplines often interpret and apply this relation inconsistently.

5.4. Analysis of Best-Performing LLMs on PEM-Rel-8K

The confusion matrices in Figure 1 provide a comprehensive evaluation of the performance of the top 5 LLMs on the PEM-Rel-8K. These models (gemma-27b, gemma-9b, mistral-22b, phi-4, and zephyr-7b) achieved high overall F1-scores, ranging from 93.5% to 91.5%, with gemma-27b performing the best.

All five models exhibit strong diagonal dominance in their respective confusion matrices, reflecting high classification accuracy across all four relation types: broader, narrower, other, and same-as. Among these, the other category consistently demonstrates the highest classification accuracy, with correct predictions exceeding 380 (out of 410) instances across all models. This suggests that distinguishing non-hierarchical, non-equivalent relations is comparatively easier.

Across all models, the most common error involves misclassifying hierarchical relations (broader/narrower) as equivalence (same-as). The opposite error, namely misclassifying same-as relations as hierarchical, occurs less frequently but remains noticeable. In details, gemma-27b (F1: 93.5%, see Fig. 1(a)) exhibits best performance, particularly for the narrower and other, correctly classifying 379 and 396 instances, respectively. Misclassifications are minimal, although some confusion persists between same-as and the hierarchical categories. Specifically, 18 instances of broader were misclassified as same-as, while 5 instances of same-as were incorrectly labelled as broader. Similarly, 24 instances of narrower were predicted as same-as, and 17 same-as instances were misclassified as narrower. gemma-9b (F1: 92.6%, see Fig. 1(b)), the smaller variant of gemma-27b, demonstrates slightly reduced performance, particularly in distinguishing same-as from hierarchical relations. Notably, 18 same-as instances were misclassified as broader and 17 as narrower, while 10 broader and 23 narrower instances were predicted as same-as. mistral-22b (F1: 92.2%, see Fig. 1(c)) performs comparably to the two gemma models but exhibits increased confusion between same-as and other relation types. In particular, 36 instances of narrower were misclassified as same-as, alongside 27 broader instances misclassified as same-as, and 16 as other. phi-4 (F1: 91.8%, see Fig. 1(d)) also displays considerable confusion between same-as and hierarchical relations. Specifically, 26 narrower and 15 broader instances were misclassified as same-as, while 34 same-as instances were incorrectly labelled as broader (17) and narrower (17), reflecting a bidirectional confusion pattern. zephyr-7b (F1: 91.5%, see Fig. 1(e)), the smallest among the top five models, exhibits the highest rate of misclassification. It frequently confuses narrower and same-as (25 and 18 misclassifications, respectively), as well as broader and same-as (18 and 15 misclassifications, respectively), indicating a greater difficulty in disentangling equivalence from hierarchical semantics.

An analysis of the data revealed that the most frequent error pattern, namely the misclassification of the hierarchical relation as same-as, is largely due to lexical overlap between terms and inconsistent definitions of semantic relationships across different ontologies. Consider, for instance, the MeSH classification: “microsporea”, a class of fungi, is a more specific term than “microsporidians”, a group of spore-forming unicellular parasites. Despite this, every top-performing model (gemma-27b, gemma-9b, mistral-22b, phi-4, and zephyr-7b) misidentified them as being same-as. Another similar example comes from the IEEE Thesaurus, where “Nanoscale technology” is a more specific concept than “Nanotechnology”. Yet, the five models all considered these two topics to be the same.

6. Conclusions

In this paper, we have presented a comprehensive analysis of the performance of a diverse set of LLMs in identifying semantic relations between pairs of research topics. This task is essential for the construction of ontologies structuring research fields, which are critical for managing and organising scientific knowledge (Salatino et al., 2025).

To support this analysis, we have introduced PEM-Rel-8K, a multidisciplinary and modular benchmark created by aggregating three datasets extracted from IEEE, PhySH, and MeSH. Our experiments show that fine-tuning LLMs on PEM-Rel-8K yields excellent performance across all disciplines. Among the evaluated models, the fine-tuned gemma-27b achieved the highest F1-score at 93.5%. Remarkably, the best-performing LLM fine-tuned on PEM-Rel-8K achieved an average F1-score across the three disciplines that was only 1.2% lower than that of models fine-tuned exclusively on a single discipline. Furthermore, our analysis of cross-discipline adaptability indicates that LLMs trained on one discipline can generalise effectively to others. These results demonstrate that PEM-Rel-8K enables the development of robust models capable of generalising across multiple research areas, making them well-suited for constructing multidisciplinary research topic ontologies.

Future work will advance along four primary directions. First, we plan to extend our research to additional disciplines. We will begin with STEM fields and then progressively expand to the Social Sciences, Humanities, and Linguistics. Second, we plan to incorporate cross-domain taxonomies, such as the Dewey Decimal Classification and the Library of Congress Subject Headings, to support cross-disciplinary alignment and to establish a standardised global framework for unifying specialised domain ontologies. Third, we aim to develop an LLM-based system for ontology matching and evolution that integrates academic ontologies by identifying and extracting core hierarchical and synonymous relations. In this context, we also plan to develop an additional module to identify more nuanced relations, such as part-of, instance-of, and has-attribute. Finally, we plan to apply this system to integrate and extend a selection of prominent taxonomies, aiming to construct a comprehensive, multi-discipline ontology of research topics. We believe that such a solution has the potential to address current issues of coverage and fragmentation, thereby providing valuable support for repositories, digital libraries, academic search engines, and AI-powered tools.

Bibliography66

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, et al. (2024 a) Phi-3 technical report: a highly capable language model locally on your phone, 2024 . URL https://arxiv. org/abs/2404.14219 . Cited by: §3.3 .
2M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T. Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R. Ward, Y. Wu, D. Yu, C. Zhang, and Y. Zhang (2024 b) Phi-4 technical report . External Links: 2412.08905 , Link Cited by: §3.3 .
3T. Aggarwal, A. Salatino, F. Osborne, and E. Motta (2026) Large language models for scholarly ontology generation: an extensive analysis in the engineering field . Information Processing & Management 63 ( 1 ), pp. 104262 . Cited by: §1 , §4.1 , §4.1 , §5.1 .
4AI@Meta (2024) Llama 3 model card . . External Links: Link Cited by: §3.3 .
5S. Angioni, A. Salatino, F. Osborne, D. R. Recupero, and E. Motta (2021) AIDA: a knowledge graph about research dynamics in academia and industry . Quantitative Science Studies 2 ( 4 ), pp. 1356–1398 . Cited by: §1 .
6ANSI/NISO Z 39.19-2005(R 2010) (2010) Guidelines for the construction, format, and management of monolingual controlled vocabularies . Standard National Information Standards Organization , Baltimore, Maryland . External Links: Document Cited by: §2.1 . · doi ↗
7S. Auer, D. A. Barone, C. Bartz, E. G. Cortes, M. Y. Jaradeh, O. Karras, M. Koubarakis, D. Mouromtsev, D. Pliukhin, D. Radyush, et al. (2023) The sciqa scientific question answering benchmark for scholarly knowledge . Scientific Reports 13 ( 1 ), pp. 7240 . Cited by: §2.2 .
8H. Babaei Giglou, J. D’Souza, and S. Auer (2023) LL Ms 4OL: large language models for ontology learning . In International Semantic Web Conference , pp. 408–427 . Cited by: §1 , §2.2 .