Loading paper

Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge | Tomesphere

arXiv:1901.10816·cs.DL·August 2, 2019

Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge

Mohamad Yaser Jaradeh, Allard Oelen, Kheir Eddine Farfar, Manuel, Prinz, Jennifer D'Souza, G\'abor Kismih\'ok, Markus Stocker, S\"oren Auer

TL;DR

This paper introduces a new infrastructure for scholarly knowledge in the form of a machine-readable knowledge graph, enabling advanced processing and curation of academic information.

Contribution

It presents the first multi-modal approach combining crowdsourced and automated methods for acquiring scholarly knowledge into a knowledge graph.

Findings

01

Users found the infrastructure novel and engaging.

02

The approach shows promise for enabling innovative scholarly knowledge processing.

03

Initial user evaluation indicates positive reception and potential benefits.

Abstract

Despite improved digital access to scholarly knowledge in recent decades, scholarly communication remains exclusively document-based. In this form, scholarly knowledge is hard to process automatically. In this paper, we present the first steps towards a knowledge graph based infrastructure that acquires scholarly knowledge in machine actionable form thus enabling new possibilities for scholarly knowledge curation, publication and processing. The primary contribution is to present, evaluate and discuss multi-modal scholarly knowledge acquisition, combining crowdsourced and automated techniques. We present the results of the first user evaluation of the infrastructure with the participants of a recent international conference. Results suggest that users were intrigued by the novelty of the proposed infrastructure and by the possibilities for innovative scholarly knowledge processing it…

Equations6

γ = [cos (p_{i}, p_{j})]

γ = [cos (p_{i}, p_{j})]

Φ_{i, j} = {10 if p_{j} \in c_{i} otherwise

Φ_{i, j} = {10 if p_{j} \in c_{i} otherwise

φ_{i, j} = (Φ_{i, j})_{c_{i} \in C p_{j} \in s im (p)}

φ_{i, j} = (Φ_{i, j})_{c_{i} \in C p_{j} \in s im (p)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge

Mohamad Yaser Jaradeh

0000-0001-8777-2780

L3S Research Center, Leibniz University of Hannover

[email protected]

,

Allard Oelen

0000-0001-9924-9153

L3S Research Center, Leibniz University of Hannover

[email protected]

,

Kheir Eddine Farfar

TIB Leibniz Information Centre for Science and Technology

[email protected]

,

Manuel Prinz

0000-0003-2151-4556

TIB Leibniz Information Centre for Science and Technology

[email protected]

,

Jennifer D’Souza

0000-0002-6616-9509

TIB Leibniz Information Centre for Science and Technology

[email protected]

,

Gábor Kismihók

0000-0003-3758-5455

TIB Leibniz Information Centre for Science and Technology

[email protected]

,

Markus Stocker

0000-0001-5492-3212

TIB Leibniz Information Centre for Science and Technology

[email protected]

and

Sören Auer

0000-0002-0698-2864

TIB Leibniz Information Centre for Science and Technology, L3S Research Center, Leibniz University of Hannover

[email protected]

Abstract.

Despite improved digital access to scholarly knowledge in recent decades, scholarly communication remains exclusively document-based. In this form, scholarly knowledge is hard to process automatically. In this paper, we present the first steps towards a knowledge graph based infrastructure that acquires scholarly knowledge in machine actionable form thus enabling new possibilities for scholarly knowledge curation, publication and processing. The primary contribution is to present, evaluate and discuss multi-modal scholarly knowledge acquisition, combining crowdsourced and automated techniques. We present the results of the first user evaluation of the infrastructure with the participants of a recent international conference. Results suggest that users were intrigued by the novelty of the proposed infrastructure and by the possibilities for innovative scholarly knowledge processing it could enable.

Information Science, Knowledge Graph, Knowledge Capture, Research Infrastructure, Scholarly Communication

††copyright: none††conference: in Review for KCap ’19: International Conference on Knowledge Capture; Nov 19–22, 2019; Marina del Rey, CA

1. Introduction

Documents are central to scholarly communication. In fact, nowadays almost all research findings are communicated by means of digital scholarly articles. However, it is difficult to automatically process scholarly knowledge communicated in this form. The content of articles can be indexed and exploited for search and mining, but knowledge represented in form of text, images, diagrams, tables, or mathematical formulas cannot be easily processed and analysed automatically. The primary machine-supported tasks are largely limited to classic information retrieval. Current scholarly knowledge curation, publishing and processing does not exploit modern information systems and technologies to their full potential (Hars, 2001). As a consequence, the global scholarly knowledge base continues to be little more than a distributed digital document repository.

The key issue is that digital scholarly articles are mere analogues of their print relatives. In the words of Van de Sompel and Lagoze (de Sompel and Lagoze, 2009) dating back to 2009 and earlier: “The current scholarly communication system is nothing but a scanned copy of the paper-based system.” A further decade has gone by and this observation continues to be true.

Print and digital media suffer from similar challenges. However, given the dramatic increase in output volume and travel velocity of modern research (Bornmann and Mutz, 2015) as well as the advancements in information technologies achieved over the past decades, the challenges are more obvious and urgent today than at any time in the past.

Scholarly knowledge remains as ambiguous and difficult to reproduce (Bosman et al., 2017) in digital as it used to be in print. Moreover, addressing modern societal problems relies on interdisciplinary research. Answers to such problems are debated in scholarly discourse spanning often dozens and sometimes hundreds of articles (Alrøe and Noe, 2014). While citation does link articles, their contents are hardly interlinked and generally not machine actionable. Therefore, processing scholarly knowledge remains a manual, and tedious task.

Furthermore, document-based scholarly communication stands in stark contrast to the digital transformation seen in recent years in other information rich publishing and communication services. Examples include encyclopedia, mail order catalogs, street maps or phone books. For these services, traditional document-based publication was not just digitized but has seen the development of completely new means of information organization and access. The striking difference with scholarly communication is that these digital services are not mere digital versions of their print analogues but entirely novel approaches to information organization, access, sharing, collaborative generation and processing.

There is an urgent need for a more flexible, fine-grained, context sensitive and machine actionable representation of scholarly knowledge and corresponding infrastructure for knowledge curation, publishing and processing (Group, 2019; Tennant et al., 2019). We suggest that representing scholarly knowledge as structured, interlinked, and semantically rich knowledge graphs is a key element of a technical infrastructure (Auer et al., 2018). While the technology for representing, managing and processing scholarly knowledge in such form is largely in place, we argue that one of the most pressing concerns is how scholarly knowledge can be acquired as it is generated along the research lifecycle, primarily during the Conducting step (Vaughan et al., 2013).

In this article, we introduce the Open Research Knowledge Graph (ORKG) as an infrastructure for the acquisition, curation, publication and processing of semantic scholarly knowledge. We present, evaluate and discuss ORKG based scholarly knowledge acquisition using crowdsourcing and text mining techniques as well as knowledge curation, publication and processing. The alpha release of the ORKG is available online111https://labs.tib.eu/orkg/. Users can provide feedback on issues and features, guide future development with requirements, contribute to implementation and last but not least populate the ORKG with content.

2. Problem Statement

We illustrate the problem with an example from life sciences. When searching for publications on the popular Genome editing method CRISPR/Cas in scholarly search engines we obtain a vast amount of search results. Google Scholar, for example, currently returns more than $50\,000$ results, when searching for the search string ‘CRISPR Cas’. Furthermore, research questions often require complex queries relating numerous concepts. Examples for CRISPR/Cas include: How good is CRISPR/Cas w.r.t. precision, safety, cost? How is genome editing applied to insects? Who has applied CRISPR/Cas to butterflies? Even if we include the word ‘butterfly’ to the search query we still obtain more than 600 results, many of which are not relevant. Furthermore, relevant results might not be included (e.g., due to the fact that the technical term for butterfly is Lepidoptera, which combined with ‘CRISPR Cas’ returns over $1\,000$ results). Finally, virtually nothing about the returned scholarly knowledge in the returned documents is machine actionable: human experts are required to further process the results.

We argue that keyword-based information retrieval and document-based results no longer adequately meet the requirements of research communities in modern scholarly knowledge infrastructures (Edwards et al., 2013) and processing of scholarly knowledge in the digital age. Furthermore, we suggest that automated techniques to identify concepts, relations and instances in text (Singh et al., 2018), despite decades of research, do not and unlikely will reach a sufficiently high granularity and accuracy for useful applications. Automated techniques applied on published legacy documents need to be complemented with techniques that acquire scholarly knowledge in machine actionable form as knowledge is generated along the research lifecycle. As Mons suggested (Mons, 2005), we may fundamentally question “Why bury [information in plain text] first and then mine it again.”

This article tackles the following research questions:

•

Are authors willing to contribute structured descriptions of the key research contribution(s) published in their articles using a fit-for-purpose infrastructure, and what is the user acceptance of the infrastructure?

•

Can the infrastructure effectively integrate crowdsourcing and automated techniques for multi-modal scholarly knowledge acquisition?

3. Related Work

Representing encyclopedic and factual knowledge in machine actionable form is increasingly feasible. This is underscored by knowledge graphs such as Wikidata (Vrandečić and Krötzsch, 2014), domain-specific knowledge graphs (Cheatham et al., 2018) as well as industrial initiatives at Google, IBM, Bing, BBC, Thomson Reuters, Springer Nature, among others.

In the context of scholarly communication and its operational infrastructure the focus has so far been on representing, managing and linking metadata about articles, people, data and other relevant entities. The Research Graph (Aryani and Wang, 2017) is a prominent example of an effort that aims to link publications, datasets and researchers. The Scholix project (Burton et al., 2017), driven by a corresponding Research Data Alliance working group and associated organizations, standardized the information about the links between scholarly literature and data exchanged among publishers, data repositories, and infrastructures such as DataCite, Crossref, OpenAIRE and PANGAEA. The Research Objects (Bechhofer et al., 2010) project proposed a machine readable abstract structure that relates the products of a research investigation, including articles, data and other research artifacts. The RMap Project (Hanson et al., 2015) aims at preserving “the many-to-many complex relationships among scholarly publications and their underlying data.” Sadeghi et al. (Sadeghi et al., 2017) proposed to integrate bibliographic metadata in a knowledge graph.

Some initiatives such as the Semantic Publishing and Referencing (SPAR) Ontologies (Peroni, 2014) and the Journal Article Tag Suite (Donohoe et al., 2015) extended the representation to document structure and more fine-grained elements. Others proposed comprehensive conceptual models for scholarly knowledge that capture problems, methods, theories, statements, concepts and their relations (Hars, 2001; De Waard et al., 2006; Boyan et al., 2008; Meister, 2017). Allen, R.B. (Allen, 2012, 2017) explored issues related to implementing entire research reports as structured knowledge bases. Fathalla et al. (Fathalla et al., 2017) proposed to semantically represent key findings of survey articles by describing research problems, approaches, implementations and evaluations. Nanopublications (Groth et al., 2010) is a further approach to describe scholarly knowledge in structured form. Natural Language Processing based Semantic Scholar (Ammar et al., 2018) and the Machine Learning focused paperswithcode.com (PWC) are related systems. Key distinguishing aspects among these systems and the ORKG are the granularity of acquired knowledge (specifically, article bibliographic metadata vs. the materials and methods used and results obtained) and the methods used to acquire knowledge (specifically, automated techniques vs. crowdsourcing).

There has been some work on enriching documents with semantic annotations. Examples include Dokie.li (Capadisli et al., 2017), RASH (Peroni et al., 2017) or MicroPublications (Clark et al., 2014) for HTML and SALT (Groza et al., 2007) for LaTeX. Other efforts focused on developing ontologies for representing scholarly knowledge in specific domains, e.g. mathematics (Lange, 2013).

A knowledge graph for research as proposed here must build, integrate and further advance these and other related efforts and, most importantly, translate what has been proposed so far in isolated prototypes into operational and sustained scholarly infrastructure. There has been work on some pieces but the puzzle has obviously not been solved or more of scientific knowledge would be available today in structured form.

4. Open Research Knowledge Graph

We propose to leverage knowledge graphs to represent scholarly knowledge communicated in the literature. We call this knowledge graph the Open Research Knowledge Graph222http://orkg.org (ORKG). Crucially, the proposed knowledge graph does not merely contain (bibliographic) metadata (e.g., about articles, authors, institutions) but semantic (i.e., machine actionable) descriptions of scholarly knowledge.

4.1. Architecture

The infrastructure design follows a classical layered architecture. As depicted in Figure 1, a persistence layer abstracts data storage implemented by labeled property graph (LPG), triple store, and relational database storage technology, each serving specific purposes. Versioning and provenance handles tracking changes to stored data.

The domain model specifies ResearchContribution, the core information object of the ORKG. A ResearchContribution relates the ResearchProblem addressed by the contribution with the ResearchMethod and (at least one) ResearchResult. Currently, we do not further constrain the description of these resources. Users can adopt arbitrary third-party vocabularies to describe problems, methods, and results. For instance, users could use the Ontology for Biomedical Investigations as a vocabulary to describe statistical hypothesis tests.

Research contributions are represented by means of a graph data model. Similarly to the Research Description Framework (Brickley and Guha, 2014) (RDF), the data model is thus centered around the concept of a statement, a triple consisting of two nodes (resources) connected by a directed edge. In contrast to RDF, the data model allows to uniquely identify instances of edges and to qualify these instances (i.e., annotate edges and statements). As metadata of statements, provenance information is a concrete and relevant application of such annotation.

RDF import and export enables data synchronization between LPG and triple store, which enables SPARQL and reasoning. Querying handles the requests by services for reading, updating, and creating content in databases. The following layer is for modules that implement infrastructure features such as authentication or comparison and similarity computation. The REST API acts as the connector between features and services for scholarly knowledge contribution, curation and exploration.

ORKG users in author, researcher, reviewer or curator roles interact differently with its services. Exploration services such as State-of-the-Art comparisons are useful in particular for researchers and reviewers. Contribution services are primarily for authors who intend to contribute content. Curation services are designed for domain specialists more broadly to include for instance subject librarians who support quality control, enrichment and other content organization activities.

4.2. Features

The ORKG services are underpinned by numerous features that, individually or in combination, enable services. We present the most important current features next.

State-of-the-Art (SOTA) comparison.

SOTA comparison extracts similar information shared by user selected research contributions and presents comparisons in tabular form. Such comparisons rely on extracting the set of semantically similar predicates among compared contributions.

We use FastText (Bojanowski et al., 2017) word embeddings to generate a similarity matrix $\gamma$

[TABLE]

with the cosine similarity of vector embeddings for predicate pairs $(p_{i},p_{j})\in\mathcal{R}$ , whereby $\mathcal{R}$ is the set of all research contributions.

Furthermore, we create a mask matrix $\Phi$ that selects predicates of contributions $c_{i}\in\mathcal{C}$ , whereby $\mathcal{C}$ is the set of research contributions to be compared. Formally,

[TABLE]

Next, for each selected predicate $p$ we create the matrix $\varphi$ that slices $\Phi$ to include only similar predicates. Formally,

[TABLE]

where $sim(p)$ is the set of predicates with similarity values $\gamma[p]\geq T=0.9$ with predicate $p$ . The threshold $T$ is computed empirically. Finally, $\varphi$ is used to efficiently compute the common set of predicates and their frequency.

Contribution similarity.

Contribution similarity is a feature used to explore related work, find or recommend comparable research contributions. The sub-graphs $G(r_{i})$ for each research contribution $r_{i}\in\mathcal{R}$ are converted into document $D$ by concatenating the labels of subject $s$ , predicate $p$ , and object $o$ , of all statements $(s,p,o)\in G(r_{i})$ . We then use $TF/iDF$ (Ramos et al., 2003) to index and retrieve the most similar contributions with respect to some query $q$ . Queries are constructed in the same manner as documents $D$ .

Automated Information Extraction.

The ORKG uses machine learning for automated extraction of scientific knowledge from literature. Of particular interest are the NLP tasks named entity recognition as well as named entity classification and linking.

As a first step, we trained a neural network based machine learning model for named entity recognition using in-house developed annotations on the Elsevier Labs corpus of Science, Technology, and Medicine333https://github.com/elsevierlabs/OA-STM-Corpus (STM) for the following generic concepts: process, method, material and data. We use the Beltagy et al. (Beltagy et al., 2019) Named Entity Recognition task-specific neural architecture atop pretrained SciBERT embeddings with a CRF-based sequence tag decoder (Ma and Hovy, 2016).

Linking scientific knowledge to existing knowledge graphs including those from the open domain such as DBpedia (Auer et al., 2007) as well as domain specific graphs such as ULMS (Bodenreider, 2004) is another important feature. Most importantly, such linking enables semi-automated enrichment of research contributions.

4.3. Implementation

The infrastructure consists of two main components: the back end with the business logic, system features and a set of APIs used by services and third party apps; and the front end, i.e. the User Interface (UI).

Back end.

The back end is written in Kotlin (JetBrains, 2011), within the Spring Boot 2 framework. The data is stored in a Neo4j Labeled Property Graph (LPG) database accessed via the Spring Data Neo4j’s Object Graph Mapper (OGM). Data can be queried using Cypher, Neo4j’s native query language. The back end is exposed via a JSON RESTful API accessed by applications, including the ORKG front end. A technical documentation of the current API specification is available online444http://tib.eu/c28v.

Data can be exported to RDF via the Neo4j Semantics extension555https://github.com/jbarrasa/neosemantics. Due to the differences between our graph model and RDF, a “semantification” needs to occur. Most importantly, the ORKG back end auto-generates URIs. Mapping (or changing) these URIs to an existing ontology must be done manually. The Semantics extension also allows importing RDF data and ontologies. The Web Ontology Language is, however, not fully supported.

In order to enrich ORKG data, we support linking to data from other sources. Of particular interest are systems that collect, curate and publish bibliographic metadata or data about entities relevant to scholarly communication, such as people and organizations. Rather than collecting such metadata ourselves we thus link and integrate with relevant systems (e.g., DataCite, Crossref) and their data via identifiers such as DOI, ORCID or ISNI.

Front end.

The ORKG front end is a Web-based tool for, among other users, researchers and librarians and supports searching, exploring, creating and modifying research contributions. Figure 2 depicts the wizard that guides users in creating research contributions. Figure 3 depicts comparing Quick Sort to other research contributions. This service combines the similarity and comparison features to deliver the depicted table. The table can be shared via a persistent link and exported to different formats, including LaTeX, CSV and PDF.

The front end was built with the following two key requirements in mind: (1) Usability to enable a broad range of users, in particular researchers across disciplines, to contribute, curate and explore research contributions; and (2) Flexibility to enable maximum degree of freedom in describing research contributions.

It is implemented according to the ES6 standard of JavaScript using the React666https://reactjs.org/ framework. The Bootstrap777https://getbootstrap.com framework is used for responsive interface components.

The front end design takes great care to deliver the best possible overall experience for a broad range of users, in particular researchers not familiar with knowledge graph technology. User evaluations are a key instrument to continually validate the development and understand requirements.

5. Evaluations

The ORKG infrastructure, its services, features, performance and usability are continually evaluated to inform the next iteration and future developments. Among other preliminary evaluations and results, we present here the first front end user evaluation, which informed the second iteration of front end development, presented in this paper.

5.1. Front end Evaluation

Following a qualitative approach, the evaluation of the first iteration of front end development aimed to determine user performance, identify major (positive and negative) aspects, and user acceptance/perception of the system. The evaluation process had two components: (1) instructed interaction sessions and (2) a short evaluation questionnaire. This evaluation resulted in data relevant to our first research question.

We conducted instructed interaction sessions with 12 authors of articles presented at the DILS2018999https://www.springer.com/us/book/9783030060152 conference. The aim of these sessions was to get first-hand observations and feedback. The sessions were conducted with the support of two instructors. At the start of each session, the instructor briefly explained the underlying principles of the infrastructure, including how it works and what is required from authors to complete the task, i.e. create a structured description of the key research contribution in their article presented at the conference. Then, participants engaged with the system without further guidance from the instructor. However, at any time they could ask the instructor for assistance. For each participant, we recorded the time required to complete the task (to determine the mean duration of a session), the instructor’s notes and the participant’s comments.

In addition to the instructed interaction sessions, participants were invited to complete a short evaluation questionnaire. The questionnaire is available online101010https://doi.org/10.5281/zenodo.2549918. Its aim was to collect further insights into user experience. Since the quantity of collected data was insufficient to establish any correlational or causal relationship (Jansen, 2010), the questionnaire was treated as a qualitative instrument. The paper-based questionnaire consisted of 11 questions. These were designed to capture participant thoughts regarding the positive and negative aspects of the system following their instructed interaction session. Participants completed their questionnaire after the instructed interaction session. All 12 participants answered the questionnaire. The interaction notes, participant comments and the time recordings were collected together with questionnaire responses and analysed in light of our research questions.

A dataset summarizing the research contributions collected in the experiment is available online111111https://doi.org/10.5281/zenodo.3340954. The data is grouped into four main categories. Research Problem describes the main question or issue addressed by the research contribution. Participants used a variety of properties to describe the problem, e.g. problem, addresses, subject, proposes and topic. Approach describes the solution taken by the authors. Properties used included approach, uses, prospective work, method, focus and algorithm. Implementation & Evaluation were the most comprehensively described aspects, arguably because it was easier for participants to describe technical details compared to describing the problem or the approach.

In summary, 75% of the participants found the front end developed in the first iteration fairly intuitive and easy to use. Among the participants, 80% needed guidance only at the beginning while 10% did not need guidance. The time required to complete the task was 17 minutes on average, with a minimum of 13 minutes and a maximum of 22 minutes. Five out of twelve participants suggested to make the front end more keyboard-friendly to ease the creation of research contributions. As participant #3 stated: “More description in advance can be helpful.” Two participants commented that the navigation process throughout the system is complicated for first-time users and suggested alternative approaches. As an example, participant #5 suggested to “Use breadcrumbs to navigate.”

Four participants wished for a visualisation (i.e., graph chart) to be available when creating new research contributions. For instance, participant #1 commented that “It could be helpful to show a local view of the graph while editing.” This type of visualisation could facilitate comprehension and ease curation. Another participant suggested to integrate a document (PDF) viewer and highlight relevant passages or phrases. Participant #4 noted that “If I could highlight the passages directly in the paper and add predicates there, it would be more intuitive and save time.”

Further details of the questionnaire, including participant ratings on main issues, are summarized in Table 1. While the cohort of participants was too small for statistically significant conclusions, these results provided a number of important suggestions that informed the second iteration of front end development, which is presented in this paper and will be evaluated in future conferences.

5.2. Other Evaluations

We have performed preliminary evaluations also of other components of the ORKG infrastructure. The experimental setup for these evaluations is an Ubuntu 18.04 machine with Intel Xeon CPUs $12\times 3.60$ GHz and 64 GB memory.

Comparison feature.

With respect to the State-of-the-Art comparison feature, we compared our approach in ORKG with the baseline approach, which uses brute force to find the most similar predicates and thus checks every possible predicate combination. Table 2 shows the time needed to perform the comparison for the baseline approach and for the approach we implemented and presented above. As the results suggest, our approach clearly outperforms the baseline and the performance gain can be attributed to more efficient retrieval. The experiment is limited to 8 contributions because the baseline approach does not scale to larger sets.

Scalability.

For horizontal scalability, the infrastructure containerizes applications. We also tested the vertical scalability in terms of response time. For this, we created a synthetic dataset of papers. Each paper includes one research contribution described by three statements. The generated dataset contains 10 million papers or 100 million nodes. We tested the system with variable numbers of papers and the average response time to fetch a single paper with its related research contribution is 60 ms. This suggests that the infrastructure can handle large amounts of scholarly knowledge.

NED performance.

We evaluated the performance of a number of existing NED tools on scholarly knowledge, specifically Falcon (Sakor et al., 2019), DBpedia Spotlight (Moro et al., 2014), TagME (Ferragina and Scaiella, 2010), EARL (Dubey et al., 2018), TextRazor121212https://www.textrazor.com/docs/rest and MeaningCloud131313https://www.meaningcloud.com/developer. These tools were used to link to entities from Wikidata and DBpedia. We used the annotated entities from the STM corpus as the experimental data. However, since there is no gold standard for the dataset, we only computed the coverage metric $\zeta=\textit{\# of linked entities}$ divided by # of all entities. Figure 4 summarizes the coverage percentage for the evaluated tools. The results suggest that Falcon is most promising.

6. Discussion and Future Work

We model scholarly knowledge communicated in the scholarly literature following the abstract concept of ResearchContribution with a simplistic concept description. This is especially true if compared to some conceptual models of scholarly knowledge that can be found in the literature, e.g. the one by Hars (Hars, 2001). While comprehensive models are surely appealing to information systems, e.g. for the advanced applications they can enable, we think that populating a database with a complex conceptual model is a significant challenge, one that remains unaddressed. Conscious of this challenge, for the time being we opted for a simplistic model with lower barriers for content creation.

A further and more fundamental concern is the granularity with which scholarly knowledge can realistically be acquired, beyond which the problem becomes intractable. How graph data models and management systems can be employed to represent and manage granular and large amounts of interconnected scholarly knowledge as well as knowledge evolution is another open issue.

With the evaluation of the first iteration of front end development we were able to scrutinize various aspects of the infrastructure and obtain feedback on user interaction, experience and system acceptance. Our results show that the infrastructure meets key requirements: it is easy to use and users can flexibly create and curate research contributions. The results of the questionnaire (Table 1) show that with the exception of ‘Guidance Needed’, all aspects were evaluated above average (i.e., positively). The results suggest that guidance from an external instructor is not needed, reinforcing the usability requirement of the front end. All case study participants displayed an interest in the ORKG and provided valuable input on what should be changed, added or removed. Furthermore, participants suggested to integrate the infrastructure with (digital) libraries, universities, and other institutions.

Based on these preliminary findings, we suggest that authors are willing to provide structured descriptions of the key contribution published in their articles. The practicability of crowdsourcing such descriptions at scale and in heterogeneous research communities needs further research and development.

In support of our second research question, we argue that the presented infrastructure prototypes the integration of both crowdsourced and automated modes of semantic scholarly knowledge acquisition and curation. With the NLP task models and experiments as well as designs for their integration in the front end, we suggest that the building blocks for integrating automated techniques in user interfaces for manual knowledge acquisition and curation are in place (Figure 5). In addition to these two modes, (Stocker et al., 2018) have proposed a third mode whereby scholarly knowledge is acquired as it is generated during the research lifecycle, specifically during data analysis by integrating with computational environments such as Jupyter.

Next steps for the ORKG include importing semi-structured scholarly knowledge from other sources, such as PWC. Such existing content is crucial to build a sizable database to use for future development, testing and demonstration.

User authentication and versioning to track the evolution of curated scholarly knowledge are additional important features on the ORKG development roadmap. Furthermore, we are working on integrating scholarly knowledge (graph) visualisation approaches in the front end to support novel and advanced exploration. Interoperability with related data models such as nanopublications will be addressed. Furthermore, we plan to integrate a discussion feature that will enable debating existing scholarly knowledge, e.g. for knowledge correctness.

Finally, we will further work on automated information extraction, to support, guide and ease information entering. Specifically, we are developing a front end feature that identifies and displays relevant conceptual zones in article content. When a user enters information about, say, the problem addressed by the research contribution then the interface will present problem-related text extracted from the article. This feature will ease entering information, possibly by enabling users to simply highlight the relevant part in the text.

7. Conclusion

This article described the first steps of a larger research and development agenda that aims to enhance document-based scholarly communication with semantic representations of communicated scholarly knowledge. In other words, we aim to bring scholarly communication to the technical standards of the 21st century and the age of modern digital libraries. We presented the architecture of the proposed infrastructure as well as a first implementation. The front end has seen substantial development, driven by earlier user feedback. We have reported here the results of the user evaluation to underpin the development of the current front end, which was recently released for public view and use. By integrating crowdsourcing and automated techniques in natural language processing, initial steps were also taken and evaluated that advance multi-modal scholarly knowledge acquisition using the ORKG.

Acknowledgements.

This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and the TIB Leibniz Information Centre for Science and Technology. The authors would like to thank the participants of the ORKG workshop series, for their contributions to ORKG developments. We also like to thank our colleagues Kemele M. Endris who coined the DILS experiment, Viktor Kovtun for his contributions to the first prototype of the front end, as well as Arthur Brack and Anett Hoppe for their contributions to NLP tasks.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Allen (2012) Robert B. Allen. 2012. Supporting Structured Browsing for Full-Text Scientific Research Reports. Co RR abs/1209.0036 (2012).
3Allen (2017) Robert B. Allen. 2017. Rich Semantic Models and Knowledgebases for Highly-Structured Scientific Communication. Co RR abs/1708.08423 (2017).
4Alrøe and Noe (2014) Hugo F. Alrøe and Egon Noe. 2014. Second-Order Science of Interdisciplinary Research: A Polyocular Framework for Wicked Problems. Constructivist Foundations (2014), 65–76.
5Ammar et al . (2018) Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew E. Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. 2018. Construction of the Literature Graph in Semantic Scholar. Co RR abs/1805.02262 (2018).
6Aryani and Wang (2017) Amir Aryani and Jingbo Wang. 2017. Research Graph: Building a Distributed Graph of Scholarly Works using Research Data Switchboard. In Open Repositories CONFERENCE .
7Auer et al . (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web . Springer, 722–735.
8Auer et al . (2018) Sören Auer, Viktor Kovtun, Manuel Prinz, Anna Kasprzik, Markus Stocker, and Maria Esther Vidal. 2018. Towards a Knowledge Graph for Science. In Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics .