Order in Innovation
Martin Ho, Henry CW Price, Tim S Evans, Eoin O'Sullivan

TL;DR
This paper combines complexity science and innovation economics to analyze vaccine innovation, revealing that innovation progresses through identifiable stages and is influenced by different types of funders, with implications for future funding strategies.
Contribution
It introduces a novel method linking technological evolution to complex networks, enabling the analysis of innovation paths and funder participation in vaccine development.
Findings
Research progresses from basic to commercial stages.
Different funders participate at distinct innovation stages.
Innovation paths reveal bottlenecks and order in vaccine development.
Abstract
Is calendar time the true clock of innovation? By combining complexity science with innovation economics and using vaccine datasets containing over three million citations and eight regulatory authorisations, we discover that calendar time and network order describe innovation progress at varying accuracy. First, we present a method to establish a mathematical link between technological evolution and complex networks. The result is a path of events that narrates innovation bottlenecks. Next, we quantify the position and proximity of documents to these innovation paths and find that research, by and large, proceed from basic research, applied research, development, to commercialisation. By extension, we are able to causally quantify the participation of innovation funders. When it comes to vaccine innovation, diffusion-oriented entities are preoccupied with basic, later-stage research;…
| Vaccine | Technology | Disease | Developer | Year first | Source | Data |
|---|---|---|---|---|---|---|
| network | platform | targeted | approved | node | source | |
| Spikevax | mRNA | COVID-19 | Moderna | 2020 | [14] | [14, 22, 23] |
| Comirnaty | BioNTech | 2020 | [15] | [15, 24, 25] | ||
| Vaxzeria | Viral Vector | AstraZeneca | 2020 | [16] | [16, 26] | |
| Zabdeno | Ebola | Janssen | 2020 | [17] | [17] | |
| Dengvaxia | Live Attenuated | Dengue | Sanofi Pasteur | 2019 | [18] | [18, 27] |
| Imvanex | Smallpox | Bavarian Nordic | 2013 | [19] | [19, 28, 29] | |
| Nuvaxovid | Subunit | COVID-19 | Novavax | 2022 | [20] | [20, 30, 31] |
| Shingrix | Shingles | GSK | 2017 | [21] | [21, 32] |
| Vaccine | Nodes | Edges | ||||
|---|---|---|---|---|---|---|
| network | Publication | Patent | Clinical trials | Funders | Grants | |
| Spikevax | 62,112 | 24,407 | 10 | 1,286 | 25,043 | 786,563 |
| Comirnaty | 37,383 | 8,127 | 76 | 1,289 | 18,744 | 340,161 |
| Vaxzeria | 58,210 | 32,367 | 5 | 1,274 | 21,528 | 648,877 |
| Zabdeno | 77,359 | 47,145 | 9 | 1,371 | 27,561 | 953,002 |
| Dengvaxia | 9,986 | 2,681 | 30 | 505 | 2,079 | 81,716 |
| Imvanex | 38,979 | 5,298 | 24 | 922 | 13,129 | 357,320 |
| Nuvaxovid | 13,855 | 1,348 | 4 | 924 | 7,547 | 104,182 |
| Shingrix | 12,987 | 6,993 | 22 | 753 | 6,288 | 174,881 |
| Funders on longest path | Funded | Citations | Citations | Critical | Critical |
| nodes in | from | per funded | path | path hit | |
| entire | funded | node | nodes | rate (%) | |
| network | nodes | ||||
| Top 5 by critical path hit rate | |||||
| Defense Advanced Research | 38 | 315 | 8 | 5 | 13.16 |
| Projects Agency | |||||
| Swedish Research Council | 59 | 246 | 4 | 4 | 6.78 |
| GlaxoSmithKline (UK) | 61 | 330 | 5 | 3 | 4.92 |
| United States Public Health | 217 | 1,139 | 5 | 6 | 2.76 |
| Service | |||||
| National Institute of Allergy and | 2,803 | 18,138 | 6 | 67 | 2.39 |
| Infectious Diseases | |||||
| Top 5 by citations in DAG | |||||
| National Institute of Allergy and | 2,803 | 18,138 | 6 | 67 | 2.39 |
| Infectious Diseases | |||||
| National Cancer Institute | 1,366 | 9,099 | 7 | 23 | 1.68 |
| National Institute of General | 753 | 5,208 | 7 | 11 | 1.46 |
| Medical Sciences | |||||
| National Heart Lung and | 451 | 3,193 | 7 | 8 | 1.77 |
| Blood Institute | |||||
| National Institute of Diabetes and | 382 | 2,555 | 7 | 0 | 0.00 |
| Digestive and Kidney Diseases |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic and Technological Innovation
**Order in Innovation **
Martin Ho1,2§, Henry CW Price3,4§, Tim S Evans3,4‡, Eoin OflSullivan1,2‡
1 Centre for Science Technology & Innovation Policy, University of Cambridge, Cambridge CB3 0HU, United Kingdom
2 Institute for Manufacturing, Department of Engineering, University of Cambridge, Cambridge CB3 0HU, United Kingdom
3 Centre for Complexity Science, Imperial College London, London SW7 2AZ, United Kingdom
4 Theoretical Physics group, Department of Physics, Imperial College London, London SW7 2AZ, United Kingdom
§These authors contributed equally to this work.
‡These authors also contributed equally to this work.
- Corresponding author: [email protected]
Abstract
Is calendar time the true clock of innovation? By combining complexity science with innovation economics and using vaccine datasets containing over three million citations and eight regulatory authorisations, we discover that calendar time and network order describe innovation progress at varying accuracy. First, we present a method to establish a mathematical link between technological evolution and complex networks. The result is a path of events that narrates innovation bottlenecks. Next, we quantify the position and proximity of documents to these innovation paths and find that research, by and large, proceed from basic research, applied research, development, to commercialisation. By extension, we are able to causally quantify the participation of innovation funders. When it comes to vaccine innovation, diffusion-oriented entities are preoccupied with basic, later-stage research; biopharmaceuticals tend to participate in applied development activities and clinical trials at the later-stage; while mission-oriented entities tend to initiate early-stage research. Future innovation programs and funding allocations would benefit from better understanding innovation orders.
1 Introduction: Why do we need to understand the order of innovations?
We all know time flows linearly in one direction. Innovation, on the other hand, is historically one-directional but nonlinear [1]. Therefore, studies that present innovation events on a linear calendar timescale alone cannot represent causality, importance, and convergence of innovation intermediaries. In multi-step reactions in chemistry, reactants do not jump straight to products; there are intermediaries with different activation energy and always a rate-determining step that the overall reaction cannot proceed faster than. Chemists often catalyse the rate-determining step to speed up the overall reaction. Likewise, an innovation process contains intermediary outputs, and we further propose that there are bottlenecks whose catalysis would accelerate the overall innovation process. Accordingly, we investigate:
In what order did individual technological breakthroughs occur to realise innovation outcomes? And, by extension, in what order did innovating entities support the most rate-limiting innovations along an order? To answer these questions, we prototype the use of a multilayer directed acyclic graph (DAG) to order scientific and technological precursors of innovation breakthroughs.
We propose methods to understand innovation order because contemporary analytical regimes may fall short of systematic causal explanations of complex, multi-phase innovation. Evolutionary economics acknowledges the existence of innovation intermediaries: Science and technology are subject to evolution, which, in turn, leads to constant changes111Evolutionary economists state that it is technologies and organisations that evolve; price, quantities, and GDPs are downstream changes. in the macroeconomy [2]. Sociotechnical transition describes the emergence of new technologies via evolutionary intermediaries and their incorporation in the society. However, this approach has so far been limited to qualitative case studies of longitudinal innovation because there is a tradeoff between scope of the innovation being analysed and depth of causal explanation. As a result, sociotechnical transition theorists call for techniques that can analyse the “heterogeneity and multi-dimensionality of large scale sociotechnical systems” [3].
We argue potential outcome reasoning “potential outcome reasoning” in natural experiments, commonly used in policy evaluation, is not the most appropriate in establishing in evolutionary economics. Rooted in clinical statistics, potential outcome reasoning estimates the causal effect of a treatment variable on an outcome variable by randomly assigning subjects into treatment and control groups so that, on average222As permitted by central limit theorem and law of large numbers, the treatment and control groups only differ by their treatment status and is independent of all other factors (Section D.1 for details). In innovation ecosystems, however, it is unlikely that (quasi-)random assignment can be achieved because ideas respect no barriers: There is no such thing as a “natural control” in innovation due to low marginal cost to adopt knowledge. Another requirement of natural experiment is that treatment, outcome, and all confounding variables be accounted for unless there is an appropriate instrumental variable. Not only is it impractical to regress all relevant variables in an innovation system, often, the representativeness of innovation variables are sensitive to time. To illustrate, a drug in a Phase I trial is focused on toxicity, whereas the same drug at Phase III relies on efficacy variables. Problems about potential outcome reasoning are not unique to innovation economics: epidemiologists, who chiefly use randomised controlled trials, struggle with different states of a same variable, the specificity of variables (e.g. what variable can exhaustively denote innovation?), the context dependence of causality, and using different types of evidence to arrive at one overall verdict [4].
Interestingly, graph theory is used alongside, rather than as an alternative to, randomised controlled trials in epidemiology [5]. However, epidemiologists’ use of DAGs is limited to non-parametric visual representations of variables in randomised controlled trials and a priori exploration of causal variables [6, 7]. Citation networks are a prime example of DAG being analytically applied to causally order innovation events [8, 9]. Their ability to support causal inference is, nevertheless, marred by incomplete data. This is firstly because citation networks typically rely on one type of data, either patents only or journal publications only, meaning not all technological maturities are represented. Secondly, citation datasets are typically generated using keyword searches only [10, 11]. Searching for patents using keyword search, for example, “biofuel”, would not result in a citation network that captures early scientific advances, in, for example, genetic engineering because the future applications of these advances were unknown. Thirdly, contrary to natural experiments that are restrictive in the dataset being used, it is often a challenge to delimit a specific dataset to construct a citation network. For instance, some citation networks lose specificity by clustering millions of patents ever filed in a country crudely by patent classification codes.
As a DAG contains a causally ordered chain of events, provided the network data is sufficient and relevant, from there we can directly observe the causal path of input A to outcome B along with all causal intermediaries. Network science describes and analyses complex systems through abstraction: with nodes representing entities and edges representing a connection between a node pair. The network approach has been successful in deducing properties of real networks, such as the fat-tailed degree distribution (power law) and community behaviours (e.g. centrality) of entities [12, 13]. However, attempts to deduce causal relations in multilayer networks remain scarce, but this is fundamental to understanding complex systems such as innovation.
This paper shows how to represent innovation order and demonstrates how this can better our understanding of innovation from an evolutionary perspective. Section 2 develops an original method of applying graph theory to innovation; Section 2.1 introduces the empirical data; Section 3 interprets results; Section 4 discusses the use of longest path to order innovation; Section 5 concludes.
2 Methods
In this section, we look at how we move from raw data to produce the citation networks which encode the multiplicity of innovation phases. One important feature is our integration of multiple sources of data. Another key difference to earlier work is that the direction of time in a citation network is fundamental to our approach. We give a formal set of definitions in the Supplementary Material.
2.1 Data
We create a multilayer citation network which is a directed acyclic graph (DAG) in order to observe innovation patterns and to test the relationship between critical scheduling events and documents on or close to the longest paths in the network, as discussed later in Section 4.
Data on medical innovations is an excellent source, not only because the concept of translation is most established in medical research, but also because new therapeutics are required by law to be reported and registered. In particular, we focus on vaccine approvals where there is excellent data available.
Vaccination confers long-lasting and protective immunity by presenting antigens of interest to elicit specific antibody production in recipients. Historically, vaccines present antigen through inactivated or attenuated version of whole or protein subunits of pathogens. Beyond efficacy, to prevent the spread of infectious agents, vaccines are administered to a large proportion of a population. Hence, vaccine must be safe and inexpensive. As a rapid countermeasure to such pathogenic outbreaks, other bottlenecks for vaccine platforms are manufacturability and ease of deployment. We outline the four vaccine platforms covered in this analysis and some technical events we expect to recover from the network in Table 1. In Appendix B we give further details of the data sources used and the innovation events we expect.
Each network we create starts from a single document approving a particular vaccine, and this is the only source node in that DAG.
We obtain our data on clinical approvals from the US Food and Drug Administration (FDA), European Medicines Agency (EMA), and the UK Medicines and Healthcare products Regulatory Agency (MHRA). When a product is authorised by any of the three entities, we use the first authorised date to represent novelty and scan for all available references from all three agencies’ authorisations [14, 15, 16, 17, 18, 19, 20, 21].
2.2 Innovation network
We start by defining the key properties of the innovation networks used in our work. Formally, a network (or graph) is a set of nodes, and pairs of nodes can be connected by an edge. In our networks, each node represents a single document which is one of four types: an innovation outcome represented by regulatory authorisation, a clinical trial, a patent, or an academic publication. So, our networks are examples of what are called multilayer networks, for example see [33], as each type of node can be visualised as placed on a different layer, see Fig. 1. Our edges, written as , are citations from one node to another node so our networks are examples of citation networks. Note that edges in citation networks have a sense of direction as represents an entry listing document in the bibliography of a document , not the other way round. So, citation networks are examples of what are known as directed networks.
Citation networks also have a sense of order since a document cannot cite a later document so for an edge , document must have been published before333Our data gives a single date for each document but in reality one can associate several different ‘publication’ dates: application and grant dates for patents, date first appeared online as opposed to the official publication date written in the text of a journal publication [34], etc. So, the data used to build a citation network can have edges that go from an earlier to a later document at least according to any single date we assign to each document, something seen in any work with citation networks such as [35]. To portray novelty consistently, we use the first published date for publications, priority date for patents, and start date for clinical trials. document . As a result, there should be no cycles (loops) in our networks. That is, if we move from one node to a neighbour, respecting the direction of the edge, and then repeat these steps as often as we want (this defines what is called a ‘walk’ in a network [33]), we will never return to the same node twice. Thus, our citation networks are examples of what are called directed acyclic graphs (DAG). The direction and the lack of cycles in a DAG are a direct result of a sense of order that is present in all DAGs. In a citation network, the order is the arrow-of-time implicit in a citation network. This order in a DAG leads to several special properties, which we exploit in our work.
In practice, we find that our data initially gives networks where 0.07% of all edges are part of a cycle, for example, due to authors citing each others’ paper during journal submission or mislabelling. We always remove these cycles (as described below) to ensure reduce our the networks we analyse are always DAGs.
2.3 Growing an innovation network
The networks we use all start from a single seed node, known as the source node, representing the regulatory marketing authorisation for one vaccine. This is because the regulatory decision represents the first time a therapeutic product is marketed and thus marks an innovation breakthrough. This regulatory authorisation node will be the newest node in each network we consider and so the only node in that network with no citations, that is, no incoming edges. This is the only node in our initial set of nodes denoted .
In the second step, we scan the regulatory authorisation for any publications, clinical trials, and patents. We denote these documents as part of the set of nodes at a ‘depth’ of one from the source vertex. An edge is added from the source regulatory document to each of these document nodes at depth one.
In addition, since regulatory documents do not normally contain patent ids, we also locate precise patents associated with vaccines through supplementary information on drug manufacturer inserts and websites. We add a node and a link from the regulatory document to each associated patent.
Once we have all the patents associated with the regulatory document , directly and indirectly through drug manufacturer inserts and websites and through the Intervention sections of clinical trial documents, we finish by looking at patent families. Each of the patents we have found is part of a patent family, something mentioned in the patent information, giving us further patents, say where label identifies a patent in the same family as . However, we do not add new nodes for each of these patents . We do find all references from any associated patent to any further document, say . However, all of these references are represented as links from the single node to document , a node in the second level set , that is we add a link . In some sense, the patent nodes at this first level represent all patents in the same family. We do not do this for patents at higher levels.
Another way we expand the patents in the early parts of our citation network is that we look at the clinical trials in the regulatory document , where there is already a link . We then search patent databases for therapeutic names recorded in the “Intervention” sections of each clinical trial document. New patents found this way are also added as nodes at the next level, part of , with links . We also perform the same search for documents cited by patents in the same patent family as as noted above.
Thirdly, we perform snowball sampling. That is, at the -th step of sampling, we have a set of documents which form the nodes most recently added to our DAG. We start the process from the set of documents , those one step away from the regulatory approval document. We follow the references in these documents to create new edges, say edge from document to a document listed in the bibliography of . If we have not encountered a document so far, then we add to the next set of documents to be considered, namely . This process produces an exponential growth in the number of documents, so we have to terminate this process at some stage. We do choose to do this after three steps because of computational limitations. This leaves us with vertices defined by the four distinct sets , , and . We also have all the edges defined by the references of the documents in the first three sets. The final step of this part of our process is to look at the references given in the last set of documents found, . If any document in this last set of documents found refers to a document we have already added to our network, then we add an edge . Should this reference be to a document not currently in our data set, we do not add this document as a new node to our network, and neither do we add an edge to .
We limit the growth of the network to three iterations for two reasons: (i) the graphs would have grown exponentially in the first iteration of the algorithm, such that any further network growth will capture innovations so distant in the past that it would be unclear whether we can attribute them to the innovation outcome; and (ii) such a growth will result in an unmanageable graph due to computational and data-sourcing limitations. At the end of the process, our network will have several nodes with no outgoing edges, and these nodes are known as sink nodes.
In doing this snowball sampling, we have access to the citation data on three types of document: clinical trials from ClinicalTrials.gov; patents from Lens.org [36]; and publication data from Dimensions.ai [37]. The exhaustiveness of the first two data sources rests on the fact that all drugs and biological products conducted under regulatory investigational new drug application must be registered on, ClinicalTrials.gov and that the FDA maintains a database of all drugs and biological products it has approved. Similar requirements exist for the European Medicines Agency, but we do not include any results based on approvals from.
As with any data on citations, our results will be incomplete. References in the original documents may not be captured in our data sets for a number of reasons, such as errors in the original document, incorrect transcription from the primary source to our electronic sources for citations, or simply some documents are not in our databases such as press releases or preprints (‘grey’ literature). The legal framework behind vaccines means that our data on regulatory approval, clinical trials and patents is likely to be better than journal citations, but no data is perfect. We have not attempted to study the effect of errors in our data, rather relying on the large scale of our data to provide some statistical safety net.
Our process up to this point has provided a directed network where the nodes always have two additional labels. First, we record which of the four types of documents a node represents. The second node label gives a single date which we call the publication date: the priority date for patents, the start date for clinical trials, and the official publication date for an academic article. Additional node information about the funding of the research reported in any document is discussed later.
However, we also require that our network is acyclic. While in principle one document only ever refers to older documents, which guarantees an acyclic network, in practice there are always cycles in raw citation networks. These arise because documents are never created in a single moment of time. In practice, documents have a range of dates, from the first formal submission of a document (such as the application to hold a clinical trial, filing of a patent, depositing a paper on a preprint server) through to a final version of a document (the end of a clinical trial, the award of a patent or the physical publication date assigned to a journal article). For journal articles, there are many possible dates we could use [34] but they usually differ by the smallest amount, typically less than a year. For clinical trials and patents, the range of dates associated with these documents can be over several years. Across the eight networks studied, the mean and standard deviation of the number of cycles per edge was . Our final step is to remove one edge from every cycle to produce a true directed acyclic graph444In practice, we use the find_cycle() function in the NetworkX package [38]. This is used iteratively to remove all cycles. When a cycle is found, the first edge of the cycle is removed from the graph and the function is run again until no cycles can be found. Alternative approaches can be used to produce an acyclic graph [35]..
2.4 Longest path in a citation network
A path in a network is a sequence of distinct nodes, , where each consecutive pair of nodes forms an edge so is an edge, from to if the edges are directed as here. In our case, paths are always moving backwards in time, as each document in a path can only cite an older document as the next step on a path. We will define the length of a path, , to be the number of edges in the path (one less than the number of nodes). In particular, we will focus on the longest paths, not on the shortest paths normally encountered in network science, e.g. as in [33]. It is one of the special properties of a DAG that the longest paths are typically of a reasonable length, making them useful measures. See Section 4 for a more detailed discussion of why we work with the longest path. We will define the distance between pairs of nodes in a DAG to be equal to the length of the longest path between two nodes.
A key assumption in our work is that the most important steps for an innovation lie on or close to the longest path in an innovation citation network. We argue that this is because knowledge is built up incrementally. Even when there are leaps in development, they are built on the success or failure of the most recent attempts to develop science. Our longest paths contain many documents that made a small contribution to the final vaccine but we suggest that all the key documents will be there. By way of contrast, had we used the shortest paths to study our innovation networks, the most widely used path in Network Science [33], the shortest paths do not contain cumulative information of knowledge inheritance. The shortest path would miss information because a document may cite important but old documents and so the shortest path will miss more recent critical developments, see Fig. 1. For a longer discussion of our choice and possible alternatives, including the differences between the longest path in a network and critical path in a schedule, see the discussion in Section 4.
In order to study the longest paths, it is convenient to define two standard properties of nodes in a DAG. The height of a node is the maximum distance from the source node (the seed authorisation document) to the node while the depth is the maximum distance from node to any of the sink nodes. The height of the DAG is equal to the largest possible value of the height, . The height of a DAG is also always equal to the largest depth of any node, which in our case is the depth of the seed node, the regulatory approval node and the only source node in our DAGs.
It is important to note that because our distance is integer valued, there can be many longest paths between any two nodes, not just one. Further, while we argue that critical developments will lie on a longest path, this is not something we can prove rigorously and, in any case, we can expect data used to form our citation network to be imperfect. Therefore, it is extremely useful to be able to look at documents that are not on one of the longest paths to the source node but instead lie on a path from source to sink that is one or two steps shorter than the longest path in the DAG. That is we will also consider documents that are close to a longest path. To quantify what we mean by ‘close’ in this context, we define criticality for a node as:
[TABLE]
Criticality takes integer values between zero and the largest possible value of height or depth, the height of the DAG . Any node which lies on a longest path of the DAG will have zero criticality. Equally, nodes which lie on a path from the source node to a sink node which is steps shorter than the longest path of the DAG will have a criticality value of . Thus, the criticality value of a node can be thought of as the distance of a node to one of the critical paths down which the key innovations flow. Applying (1) to Fig. 1, nodes to have a criticality of [math], indicating they are on the longest path, whereas node has a criticality of 1.
In other words, for a given node in the innovation network, the node’s height from the source node and depth from a sink node are uniquely defined. The novelty of our analysis is that we derive the longest path in the network by taking the criticality using height and depth. The criticality values of nodes not only shows which nodes lie on longest paths (nodes with zero criticality) but also associated nodes lying on “near-longest paths” (small values of criticality). It is easy therefore for us to find other critical innovations which may have been missed by any method based on a single path, c.f. conventional main path analysis [39, 40] which always returns a single path (see Section D.2 for further discussion of main path analysis).
2.5 Measuring funder activity as a function of time
For each node, there is a possibility that grants and funders linked to the research are reported in the associated document. We also look for specific entities in the acknowledgements for increased coverage. On Dimensions, some publications, patents, and clinical trials are connected to grant nodes, providing additional details such as the value of the grant555However, we do not use monetary information in this study as we do not know how grants are split up by several publications, patents, or trials, associated funder, and funding period. When measuring the effect of funders, we look at nodes and their associated funders, either directly via document-funder edges or indirectly via document-grant-funder edges, at one citation step from the grant attached to that project.
This information on the relationship between documents recorded in our multilayer citation DAG and funders means that every node can be associated with a subset of funders666We could think of this as a new layer forming a bipartite network between document nodes and funder nodes. In our work, we only look at simple measures relating to funders, so such a network description of the funding landscape is unnecessary here. All the networks we discuss here are multilayer citation networks, no funding or grants are encoded in the network structures we analyse.. We can now look at the properties of those nodes linked to any one funder, such as height and depth, and use various summary statistics, such as the median document height, to understand the different roles played by different funders in the innovation process.
3 Results
3.1 Descriptive statistics
The eight citation networks we study contain a total of 569,660 nodes and 4,384,502 edges as shown in Table 2. What is interesting is that the two vaccine platforms which were commercialised after 2020 contain roughly two-fold more publications and five-fold more patents than the four new vaccines using more established vaccine platforms. Similarly, the citation networks of the two new vaccine platforms contain more edges than that of the two older platforms. More remarkable is the fact that the newer vaccine platforms contain twice the proportion of inter-layer edges777edges that involve two node types, e.g. publication-to-patent edges, as opposed to singular node type, e.g. patent-to-patent edges. (5.7%) than that of the older vaccine platforms (2.9%). This indicates that more translation is needed to commercialize the new vaccines.
3.2 Critical innovation path narrates causality in innovation
Graphically, we follow Eq. 1 and plot depth as a function of height. We plot the data from the mRNA vaccine graph in Figure 2 to illustrate: the bottom left node represents regulatory authorization and the top right nodes represent the earliest nodes in the network. The diagonal represents nodes which lie on at least one longest path of the DAG, our critical innovation paths while numerous sub-critical nodes populate the region above the diagonal. From the hue of the diagram, we also observe a cluster of non-critical innovations at the region with low height and low depth. We observe the same pattern in all eight DAGs as shown in Appendix C. We set forth to inspect: (i) nodes that are critical, (ii) the order of critical nodes from oldest to newest, and (iii) nodes that are of lowest criticalities to test the theoretical equivalence between critical schedule path and longest network path.
Looking at nodes whose criticality is strictly zero (i.e. most critical), in each DAG in Fig. 3 we see a mix of nodes representing publication, clinical trials, and regulatory authorisation. If we relax the criticality threshold to consider nodes whose criticality is below 19.5% of the maximum height, we begin to see many more publications, a few more clinical trials, and a few patents in this relaxed critical path region. The order of the critical path, moving from high to low height nodes, always proceeds from publications, intertwined with a much smaller number of patents if in the version with the 19.5% threshold, followed by phase 1, 2, and 3 clinical trials, before ending with the regulatory authorisation. This sequence generally proceeds from basic research (publications), applied research (patents), development (clinical trials), to commercialization (regulatory authorisations).
A closer look at the critical path nodes unveils a logical sequence of technical progression. For instance, the Moderna mRNA vaccine DAG has its longest paths formed by early attempts to apply mRNA as influenza vaccine platform [42, 43], using liposomal delivery system to enhance the expression kinetics of mRNA vaccine [44, 45, 46], methylation to enhance in vivo antigen expression [47], the phases 1-3 clinical trials of mRNA COVID vaccines (NCT04283461, NCT04796896, NCT04847050, NCT04470427), and finally the FDA emergency use authorisation letters [48] events that the scientific literature is well aware of [49, 50]. In addition, the longest path of the same DAG also identified critical discoveries that may have been overlooked: mRNA post-transcriptional modification mechanisms [51, 52, 53, 54, 55, 56] and early basic research about the potential to modify RNA to evade detection by toll-like receptors [57, 58, 59].
We are also interested in the identity of non-critical nodes. Having low criticality in a DAG does not mean the innovation is unimportant; it means events are not rate-limiting and can be perhaps parallelised. Empirically, in the BioNTech/Pfizer COVID vaccine DAG, for example, nearly all reviewed nodes with low criticality are either clinical research about prevalence and risk factors for diseases non-specific to COVID. Low criticality events are likely non-critical to the approval of the vaccine by regulatory agency and, in this example, used to facilitate the design of clinical protocols.
3.3 Calendar time against height reveals innovation speed
The order inherent in a DAG gives a natural ‘clock’ for the innovation process captured by our citation network. It is interesting to see how this network order compares against calendar time. To see this, we plotted the number of days between a document’s date and the final regulatory authorisation against the height of that document in Fig. 3. This shows that calendar date is strongly correlated with network order, but the relationship is non-linear. Broadly speaking, the smallest calendar day at every height are nodes on the longest path (i.e. they are nodes with 0 criticality). Why do the slopes in Fig. 3 differ despite both describing the longest path? What is the difference between calendar days and depth? Time and network order in a citation network proceed in the same direction. This is because new documents can only cite older documents and, similarly, innovation is cumulative [60]. However, their unit of progression differ: time proceeds in evenly spaced seconds or days, whereas network order proceeds in citation steps that are non-equidistant888An analogy is a clock where the ticks are spaced out differently.. The latter means that the time gap and frequency of citations can increase and decrease over the course of an innovation lifecycle. This is possibly due to the cumulativeness of knowledge, entries and exits, consumer demand, and innovation policy. The different time gaps are observed across all eight vaccine networks (Fig. C.2).
Visually, Fig. 3a suggests the publication date is rising at a constant rate for most critical nodes, but the slope increases for documents with a normalised height close to . We have tried to estimate the rate of change of publication date against height in Fig. 3b by smoothing the data for those nodes on a longest path. On small scales, the change in height with calendar time fluctuates as seen in Fig. 3b. On a larger scale, the trend overall shows that height and time are reasonably correlated. This could show that network order provides an alternative measure of innovation progress compared to calendar time (Section D.6).
Vaccines have both forward (future) and backward (past) citation, and because the patent process often spans several years. We found that due to interactions between patent applicants and examiners during patent prosecution, the patent document may be updated with new references. We use the initial patent submission date as our patent publication date. A year or two into the patent process, a recent paper can be added to the application, one that was published after the patent was submitted. As a result, a patent may cite forward in time as well as the logically acceptable backwards in time. We could use the patent award date as our patent publication date, which would solve the problem with the example just given. However, now we run into problems with documents that cite a patent that is not yet approved and is a critical part of the innovation process. This again illustrates why our using the height of a node in our citation network can be a more consistent record of the logical order in the innovation process.
Second, the order of node types along the critical path in Fig. 3b shows a clear progression999If we consider non-zero criticality nodes, we start to see more overlaps between node types. of publications (basic research) to patents (applied research) to clinical trials (development).
Third, the rate of change fluctuates within each node type. The rates of change for critical patents and clinical trials fluctuate between 300 and -300 days, with the negative values indicating the problems of using a single publication date for patents, as these are revised over the several years it takes for a patent to be approved. On the other hand, it takes 50-1300 days for height to increase by one in the early critical journal publications, whereas more recent critical publications, those closer to the regulatory approval, have one year for a height increase of one, indicating an increasing rate of innovation. A plausible explanation is towards late-stage (low calendar days from regulatory approval), the purpose of innovation activities are better known and focused towards the vaccine; more complete knowledge about and greater participation in the vaccine may have led to increasing frequency of critical innovations. In future studies, Fig. 3b could serve as a measurable interpretation of Utterback and Abernathy’s [61] industry lifecycle model, which hypothesises that the “rates of” product and process innovation over time are convex and concave respectively; as well as the linear innovation model [62], which prescribes that basic research, applied research, development, and production be carried out by different sets of actors one stage after another.
While the primary aim of this section is to propose new methods to measure the order and rate of innovation, we cannot help but observe some interesting differences across the vaccine data, see Fig. C.3 in the Appendix: it always takes less time to make critical progress at later innovation phases. Future studies may compare data across sectors to reposition science policy’s role in accelerating innovation.
3.4 Division of innovation labour is quantifiable via network height
Using the findings above, we demonstrate another real-world utility of innovation order. We portray the frequency of innovator funding as a function of network height to discern the innovation phases entities are supporting (Section 2.5 for methodological details. Fig. 4 shows illustrative data from the Novavax COVID protein subunits vaccine where we show the top five funders by number of nodes funded, three mission-oriented innovation agencies101010Mission-oriented innovation agencies are entities that specifically fund frontier innovations to attain specific goals [63]. These entities are hypothesised to behave differently to diffusion-oriented agencies [64, 65], but this difference was awaiting quantification., and the top five pharmaceuticals by number of nodes funded. In the same table across the eight vaccines, we observe that the largest funders tend to occupy lower height, or late-stage; pharmaceuticals fund mostly late-stage documents; whereas mission-oriented innovation agencies are in the early- to mid-stage. Looking at calendar time, the median days of mission-oriented agencies and pharmaceuticals (2-19 years) are much closer to the regulatory approval than large funders are (10-27 years). This may indicate the strategies and division of labour among innovation entities: Larger funders fund basic and risk-averse research, mission-oriented agencies initiate high-risk research and translate discoveries to other funders, and pharmaceuticals playing their obvious commercialization role at the late-stage. However, we do not know whether this division of labour is deliberate or a result of their funding agenda111111For instance, the mission statement of the National Institutes of Health is to “to seek fundamental knowledge about the nature and behaviour of living systems…” whereas that of the Biomedical Advanced Research and Development Authority is to “develop and procure medical countermeasures…”.
3.5 Criticality of innovation funders is quantifiable via longest path
We take the definition of “critical” from operations research to mean an event that delays the global project schedule when locally delayed (see Section 4 for details). With the criticality information from the DAG, we also compute the criticality of funders, measured by the number of critical nodes funded by a particular entity divided by the total number of nodes funded by the entity in the DAG. We compare this performance metric with the citations received by documents the funder funds within the network.
Table 3 shows that funders who fund research that, in turn, leads to large number of citations are not necessarily the funders of critical research. Entities that fund a high proportion of critical nodes are avid removers of innovation bottlenecks, who, in turn, allow progression along the technological trajectory. One caveat is that we do not know whether these critical funders deliberately removed innovation hurdles or unintentionally produced innovations that were applied to advance a technology by other entities. Another caveat is that some funders specialise in advancing basic science without thought of practical ends, while others specialise in translating innovation.
3.6 Validation
We validate the critical path by checking for documents that also appear in literature reviews published by subject-matter experts121212This is unlike most main path analyses (Section D.2) that do not validate their results or only validate the keywords they use to generate the citation network.. Fig. 5 shows the height versus depth diagrams (as described in Fig. 2) for the Moderna and Pfizer/BioNTech vaccines but with additional annotations showing 352 documents found in three literature reviews on mRNA vaccines [66, 50, 66]. We found that the critical path (the hypotenuses) are heavily populated by documents referenced in the literature reviews. Fig. 6 shows that documents found both in the Pfizer/BioNTech vaccine network and literature review have lower median criticalities of 0.142 [0.0885, 0.0230] (where we give 25% and 75% in brackets) compared to documents found only in the former where the median is 0.504 [0.274, 0.788]. Similarly the figures for the Moderna network are 0.0710 [0.0328, 0.123] versus 0.0333 [0.169,0.574]. Kolmogorov-Smirnov tests indicate that criticalities of documents found in literature reviews is significantly different to that of documents not found in literature reviews (the p-values are always much less than ), validating the use of the critical path method to identify important innovation events.
4 Discussion
One of the most useful measures in network science is the length of the shortest path between two nodes as this is used in numerous situations as the distance between two nodes131313The length of the shortest path between two nodes in a network satisfies all the criteria is what is formally defined as a ‘distance’ function in mathematics. Indeed, it also satisfies the mathematical criteria to be a ‘metric’ and the shortest paths are therefore ‘geodesics’.. In many cases, these shortest paths are of practical relevance as we often look for the quickest, shortest route between two objects (nodes) in a network. For instance, in a social network, where people are nodes and edges are the connections between friends, the shortest path often represents the quickest way to get information between people. It is the basis for the popular idea of the six degrees of separation [33]. As a result, this shortest-path measure of distance between nodes is the basis for many other fundamental measures in network science, such as centrality measures [33].
In most networks, the longest path between two nodes has little practical relevance. For instance, in a social network, the longest path between two people will usually be a path visiting almost everyone in the network once, and we can think of no practical use for the distance of such a path in that context141414Paths which visit every node are sometimes of interest, e.g. in the classic travelling salesman problem. However, in such cases the length of the paths is defined differently, in terms of the length of time to travel the network, the sums of the travel times associated with each edge. In such cases, there are many paths passing through all nodes, and the problem is still to find the shortest of these paths when distance is measured in terms of the sum of the time taken to travel each link in the path. So, this is still searching for the shortest paths out of a set of options, but using a different measure for distance from the one we are discussing at this point in the text.. However, the order encoded in a DAG means that the longest path between two nodes in such networks is not especially long; the length of the longest path in a DAG is rarely similar to the number of edges, as it is in the social network example above. So, now the question is, which path should we use when analysing the flow of information in our multilayer innovation citation networks, the shortest or the longest paths?
At a qualitative level, we can see that the longest path is likely to be more interesting for citation networks. The oldest papers we cite in our bibliography here are over sixty years old. In preparing this paper, it is likely that we did not learn much of direct relevance to the current paper by reading such papers. Such old papers are classics, but their influence is indirect, felt in our present work via more recent publications which apply these classic concepts in a modern context with modern terminology and notation. Conversely, it seems likely that the most recent papers give a much more powerful stimulus to authors.
Most papers in the bibliography of a journal article are recent, and the time difference between the citing document and the documents in one reference list decreases exponentially when the age difference is a couple of years or more, [68, 69, 70, 71]. Further support comes from studies which show that the text for over 70% of references in the bibliography of a journal article may have been copied from the publications of intermediate age suggesting that a large fraction of these older articles are not read when a new article [72, 73, 74, 35]. That is consistent with the idea that the ideas in these older texts were learnt from intermediary texts. In the case of innovation, the longest path embeds a chronological sequence of key technological advancements. The longest paths in a DAG embed the chronological and causal sequence of key scientific and technological advances contributing to a technological outcome.
The use of longest paths, not shortest paths, to analyse DAGs is common in other areas. The critical path method is used to schedule jobs in a project, such as the Manhattan Project [75, 76, 77] or independent parts of a numerical simulation running on multiple processors. In this method, the DAG captures the dependency (the edges) of one job (one node) on another. An innovation network path length, considered as a scheduling DAG, is the sum of the time needed to complete each job on the path (so not simply the number of edges in the path). The aim is to find the “critical path” which is the path that sets the least time needed to complete the project. The critical path is set by the longest path in the scheduling DAG151515The critical path is of greater relevance to mission-oriented innovation programs than diffusion-oriented ones [63]. Innovation missions involve organizing multiple innovation projects and programs, all with intermediate outputs, to attain specific goals and would benefit from finding the minimum time to attain an innovation goal..
Badiru [77] defines three important aspects of the critical path method that we test using larger and more complex innovation datasets:
An activity is considered critical if changing the start or finish time of the activity will affect the overall project schedule 2. 2.
The series of critical activities connecting the start and end points of a project is known as the critical path. Logically, the critical path “turns out to be the longest path in [a] network”. 3. 3.
A delay in any critical activity delays the entire project. Therefore, the “sum of durations for critical activities represents the shortest possible time to complete the project”
This means:
Critical schedule path longest network path shortest time path
There is also a more formal basis for the use of longest paths in DAGs. The order in DAGs is often derived from the flow of time, so it makes most sense to look at embedding a DAG in space-time not simply in a space. Technically, DAGs are best embedded in a Lorentzian space-time, such as Minkowski space used in special relativity, rather than a Riemannian space, such as Euclidean space used for school-level geometry and in most traditional data science methods. For simple models, it is possible to show that the longest path in the DAG is the best approximation to the geodesic in the space-time161616Analytical results are derived in Bollobás and Brightwell [78] and earlier papers cited therein. These results are applied in Brightwell and Gregory [79]. Numerical results are in Rideout and Wallden [80]. [78, 79, 80]. Geodesics represent paths of “least resistance”, the path a freely moving particle would follow in the space. So, by analogy, it makes sense to think of longest paths in our innovation citation networks as representing the easiest route for knowledge to flow between two documents (nodes) in our citation network. This analogy has been successfully tested in the context of citation networks to show that geometric concepts like the dimension of a DAG can be defined using Minkowski space-time [81] or how to embed a DAG in Minkowski space-time [82].
Unsurprisingly, the study of innovation through citation networks is not a new subject. The best known approach is “main path analysis” proposed by Hummon and Doreian [83] and, with variations, implemented in some popular analysis packages [39] (see Section D.2 for details). Our critique of the main path analysis method is that there is no formal basis for the method, unlike the work on the relationship between the longest path and geodesics in space-time models. The weights used in main path analysis are formed by looking at all paths from a set of initial nodes (such as the first publications in the data set, sometimes all publications) to a set of dfinal destination nodes, each path equally weighted. However, we know some publications listed in a bibliography are more important than others, so it is unclear why giving all paths are equal weight is a good way to capture the flow of innovation. While popularity of main path analysis is one measure of a successful method, this popularity could be due to other factors such as the easy access to numerical implementations in widely used packages such as pajek [39]. An alternative view of main path analysis can be found when it fails to identify the backbone of the technological trajectory of the semiconductor industry [84]. Our interpretation is that the method may fail due to its reliance on a single path rather than the inability to look at good paths for innovation.
5 Conclusion
When studying innovation, using a citation network DAG as opposed to other econometric approaches allows us to causally trace every intermediate step between a complete set of innovation inputs and an innovation outcome. Rather than regressing a limited set of variables, a citation network is a DAG, so this encodes the geometry of innovation order. Theoretically, the longest paths in a DAG represent the critical causal routes which show the bottlenecks that constrain innovation; it is the most complex route to achieve because it is composed of the largest amount of linear components which cannot be parallelised. We hypothesise that the longest paths in these multiplayer citation networks where order matters are the critical paths of innovation.
To verify the usefulness of the longest path in describing critical innovations, we prototyped two methods: (i) a reproducible way to construct multilayer citation network thus representing basic research (publications), applied research (patents), development (clinical trials), and commercialisation (regulatory authorisations), and (ii) a simple way to quantify a document’s closeness to the longest path. These methods allow us to analyse events that turn out to be in the longest paths of eight vaccine citation networks. We were able to observe how basic discoveries in the lab accumulated and got absorbed by clinical researchers, who used these phenomenological observations to hypothesise what could work for a vaccine. Once a prototype vaccine product was available, the technological community further applied basic discoveries to optimise the product, which was eventually validated through clinical trials and approved for marketing.
As seen in Fig. 1, we can define the types of edge, in terms of the labels of the nodes at the end of the edge. This method of defining edges is useful in understanding innovation “translation” will be explored in a separate study. The proposed method to quantify criticality of innovation events empowers scientific understandings of technological change, particularly in: (1) comparing criticality patterns across industries and time spans, (2) comparing criticality patterns of funders, (3) measuring linearlity of innovation phases, (4) informing mission-oriented innovation planning, (5) attributing technological outcomes to events and entities and (6) forecasting new innovation outcomes based on intermediate outputs.
By assembling a list of innovation events in the observed technological past, we inform the ingredients needed in similar future technological programs. Using this innovation ruler, we demonstrated the possibility to measure the rate of innovation, division of innovation labour, and criticality of innovation funders.
Appendix A Formal Definitions
Here we give the definitions of the network properties used in the main text using a formal language. The main aim is to arrive at a formal proof that nodes with zero criticality lie on a longest path in the DAG, Lemma A.4.
Definitions A.1**.**
Network/Graph, Nodes, Edges
A network (here synonymous with graph ) is a set of nodes (also known as vertices) and a set of edges . An edge is denoted where .
Definitions A.2**.**
Layers
A multilayer network is a graph where nodes are connected by different types of edges (for example, see section 4.2 in [33]). In our case, the layers are defined by a partition of the nodes into different types so each layer contains nodes of only one type and each node exists on just one layer (a particular type of multilayer network). In our setting, we have an unweighted graph with four main types (Publication, Patent, Clinical trials, Regulatory approval) of nodes. The label of the node at each end of an edge, say and , leads to a partition of the edges into different types (the definition of a multilayer network), that is .
Definitions A.3**.**
Directed Graphs and Edges
A directed network has directed edges where the edge is distinct from the edge .
Definitions A.4**.**
Predecessors and In-Degree
The predecessors of a node is the set of nodes which are connected by an edge to , so
The in-degree of a node is the number of incoming edges, .
Definitions A.5**.**
Successors and Out-Degree
The successors of a node is the set of nodes which are connected by an edge from , so .
The out-degree of a node is the number of outgoing edges, so .
Definitions A.6**.**
Walk, Path, Cycles and Path Length
A walk from node to node , denoted , is a sequence of nodes starting at and finishing with which are connected sequentially by edges
[TABLE]
The length of a walk is the number of nodes minus one .
A path is a walk where all the nodes are distinct, so iff in (A.1).
A cycle is a walk where the first and last node in are identical, in (A.1), but all other nodes are distinct.
The concatenation of two walks is where a walk from to is combined with a walk from to to produce a walk from to via . We simply extend the sequence of nodes in the first walk with the nodes in the second walk keeping the nodes in the same order and we only include the common end/start node once. We denote this as and formally we may define this as
[TABLE]
Definitions A.7**.**
Directed Acyclic Graph — DAG, Sources and Sinks
A Directed Acyclic Graph, a DAG, is a network/graph with directed edges containing no cycles.
A node with no incoming edges, that is , is known as a source node
A node with no outgoing edges, that is , is known as a sink node
Definition A.8**.**
The Partial order of a DAG
A DAG always defines a unique partial order on the set of nodes in which we have a binary relation between two nodes, denoted , if and only if there is a path from to .
[TABLE]
Note that under our definition of a path, every node is part of a trivial path of length zero so this binary relation is reflexive, i.e. , as required. Transitivity of a partial order comes from the fact that a concatenation of paths produces a path.
Definition A.9**.**
Distance between nodes
The distance from node to node in a DAG is defined to be the length of the longest path from to , while the distance is left undefined if there is no such path.
[TABLE]
This definition of a ‘distance’ in (A.4) is not sufficient for this function to be a distance in the formal sense used in mathematics because (a) many pairs nodes in a DAG may not be connected and this distance is undefined for such pairs and (b) this definition is not symmetric since if is defined then is not defined unless . Note that both of these issues are easy to fix if a distance in the formal mathematical sense is required171717For instance, consider where of (A.4) when . We then define when . Finally we set whenever and . This function is a formal mathematical distance function..
Note that there are often many paths between two nodes and with the same length. This includes the longest paths, those with length equal to .
Also note that many other distances functions can be defined for pairs of nodes on a DAG such as the length of the shortest path between two nodes. We will only use the distance defined here in terms of the length of longest path.
Definition A.10**.**
Reverse Triangle Identity
The distance between two nodes satisfies the reverse triangle identity
[TABLE]
The reverse triangle identity has the opposite inequality from that found in the usual triangle inequality. The other difference is this identity does not apply for other permutations of the three sites , and .
Transitivity of the partial order guarantees that . This means that we can consider one of the longest paths from to , say , whose length gives the value to . Likewise, we know we have a path which is a longest path from to of length . If we concatenate these two paths, we produce a path from to which is of length . The concatenated path is a path from to , but it need not be a longest path between these two, even though it is made by combining two longest paths. Since the distance (A.4) is set by the largest path length, it means the length of the concatenated path sets a lower bound on the distance from to . By definition, if there is another path from to and if such a path is longer than the concatenated path , then this alternative path will set the distance and that will be larger than . Hence, the simple properties of paths and the maximum function in (A.4) give us our reverse triangle identity.
Definitions A.11**.**
Height, Depth and Criticality
The height of a node in a DAG is the length of the longest path to that node from any node.
[TABLE]
The depth of a node in a DAG is the length of the longest path from that node to any node.
[TABLE]
The height of a DAG is the largest height of any node
[TABLE]
The criticality of a node in a DAG is the height of the DAG minus the height and minus the depth of that node
[TABLE]
The terminology ‘height’ and ‘depth’ are common when working with DAGs but ‘criticality’ is our own terminology for in (A.9).
If the distance function satisfies the properties of a formal mathematical distance then the height and depth are always defined and they will have a value of at least zero, . The only nodes with height zero are source nodes and the only nodes with depth zero are sink nodes.
With height and depth, we are effectively defining a distance between a node and the ‘beginning’, some source node , and the ‘end’ of our DAG, some sink node . We can formalise this in the following lemmas, which lead to our key result on the bounds for criticality of a node and on the interpretation of zero criticality value for nodes in a DAG.
Lemma A.1** (Path leading to the height of a node).**
The height of a node is always the length of a longest path from some source node to .
Proof.
Suppose this were not true and the height of is based on a path from some node to . If is not a source node then there must be a node preceding connected by an edge . By definition and so by the reverse triangle identity (A.5) we know that so . Thus, the node does give the longest path to and this node does not define the height of node . We have a contradiction and so deduce that the height of the path must use a node with no predecessors, i.e. a source node. ∎
Using the same type of argument used to prove Lemma A.1 we can quickly show the following lemma.
Lemma A.2** (Path leading to the depth of a node).**
The depth of a node comes from the longest path from to a sink node.
We show this using the same arguments used in Lemma A.1 but now applied to paths from node to some node . If has a successor, , i.e. is an edge, then node is not involved in defining the depth.
Lemma A.3** (Path leading to the height of a node).**
The height of a DAG, , comes from the length of a longest path from a source to a sink node.
Proof.
We use the same arguments as we did for the last two lemmas. The only paths that can not be extended at either end to produce longer paths are those running from a source node to a target node. Thus the nodes with the largest heights are the sink nodes. The height of the DAG comes from the largest of all heights (A.8), so this must run to a source node and, by Lemma A.1, run from a sink node. ∎
A couple of corollaries follow from this. First that the height of the DAG is the length of a longest path anywhere in the graph. Second that the height of the graph is also equal to the largest value of the depth which has to be for the depth of one (or more) of the source vertices.
Finally, we can put these ideas together to show the following lemma that is the basis for our analysis.
Lemma A.4** (Zero criticality nodes).**
A node with zero criticality, , lies on a longest path in the DAG. Nodes with positive criticality do not lie on a longest path of the DAG.
Proof.
Consider a node with height based on a longest path from source to (from Lemma A.1) and depth obtained from some longest path from to sink node (from Lemma A.2).
The path from source via to sink node obtained by concatenating paths and has length by definition of concatenated paths in (A.2). This concatenated path need not be a longest path from to , which has length , but in that case, must be shorter than this longest path so . Equality, , can therefore only happen if the concatenated path is also a longest path from to .
If this longest path from source to source is one of the longest paths in the DAG, then is the height of the DAG. In this case we have and so which is the first of the lemma.
The converse of this statement, the second part of the lemma, follows from the following cases. Case (i) is where the concatenated path is a longest path from source to source but is not one of the longest paths in the DAG so . Case (ii) is where the concatenated path is not a longest path from source to source so . In this second case, we know regardless of the nature of the concatenated path so again . In either case, is not on a longest path in the DAG and proving the second part of the lemma. ∎
As a corollary of this lemma, we can see that the height of the DAG is also the largest possible value of the depth of any node.
Appendix B Empirical data
In our context, the sources of our DAGs are the nodes representing the FDA approval on one of eight vaccines. We have chosen to work with vaccines produced from one of four different methods, and we outline these different approaches and associated vaccines in this section.
B.1 Viral vector (adenovirus vector) vaccine platform
Technical principle and bottlenecks. A viral vector vaccine (VVV) is a relatively novel vaccine platform that uses virus to infect host cells with the genes of pathogens; the infected host cells then transcribe and translate the genes into antigens of the pathogen. Following vaccination, T cells and B cells respond against both to the viral vector itself and, more importantly, the antigen encoded viral vector. Viral vector vaccines can rapidly adapt from one pathogen to another because only the gene of interest needs to be exchanged. An adenoviral vector vaccine (AVV) is a subtype of the viral vector vaccines that exploits the high transduction efficiency and pervasive tropism of adenoviruses to facilitate the expression of target antigen. An adenoviral vector vaccine is produced by deleting the replication genes from an adenovirus serotype and inserting the genetic sequence of interest to the virus. This is followed by viral vector production in manufacturing cells and purification [85]. Historically, the efficacy of a viral vector vaccine has be challenged by hosts’ immunity against the viral vector [86] and the surveyed vaccines should exhibit mechanism to circumvent this issue: finding rare or non-human adenovirus serotypes and vectorising adenoviruses from non-human primates [87, 88].
Data. As of October 2022, four viral vector vaccines have been cleared by the FDA (the US Food & Drugs Administration) for use in humans; we consider two of them which represent the first uses of adenovirus as vaccine vector: Zabdeno181818Mvabea, the second dose of the Zabdeno/Mvabea regiment uses modified vaccinia Ankara (a poxvirus) as vector. (against Ebola, developed by Janssen, first authorised for use in 2020) [17] and Vaxzevria (COVID-19, AstraZeneca, 2020) [16].
B.2 Nucleic acid (mRNA) vaccine platform
Technical principle and bottlenecks. Both DNA and RNA can elicit immune response, but to date, only mRNA vaccines have been authorised by the FDA. Similar to adenoviral vector vaccines, nucleic vaccines are highly immunogenic, easily adopted, and readily manufactured compared to inactivated/attenuated vaccines. The two currently commercially available nucleic vaccines work by expressing antigens of interest via nucleoside-modified mRNA encapsulated in lipid nanoparticles (LNP) [89]. To arrive at the mRNA vaccines we have, innovators had to understand how mRNA elicits an immune response, how to control the amount of innate inflammatory reactions to therapeutic mRNA, how to deliver mRNA to transfect cells; the synthesis, purification, and buffering of mRNA [49, 90, 50, 89].
Data. Similar to adenoviral vector vaccines, only two mRNA vaccines are authorised by the FDA at the time of writing: Spikevax (COVID-19, Moderna, 2020) [14] and Comirnaty (COVID-19, BioNTech/Pfizer, 2020) [15].
B.3 Whole pathogen (attenuated) vaccine platform
Technical principle and bottlenecks. The very first vaccine was an example of a whole pathogen vaccine (WPV) and the word “vaccine”, from the Latin “vaccinus”, comes from Jenner’s use of cow (“vacca”) pox to prevent smallpox. A typical whole pathogen vaccine contain microbes that are live but attenuated, meaning they are weakened strains, or they are inactivated, meaning they are killed or altered to prevent replication. For instance, the vaccines behind the eradication of polio contain inactivated poliovirus. Although whole pathogen vaccines are a two-century-old innovation, pathogenic attenuation or inactivation does not readily guarantee a viable vaccine due to immunogenicity, safety, and yield issues [91]. This has led to newer vaccines to experiment with novel techniques such as the use of hydrogen peroxide and gamma irradiation as alternative means of inactivation [92, 93].
Data. Since whole pathogen vaccines are based on the oldest approach to vaccination, we use two examples as a baseline for comparisons with novel mRNA and viral vector vaccine platforms. We choose two whole pathogen vaccines that were recently cleared by the FDA: Dengvaxia (Dengue, Sanofi, 2019) [18] which is one of the first vaccines against dengue, and Imvanex (Smallpox, Bavarian Nordic, 2013; aka. Jynneos) [19] which is a third-generation smallpox vaccine that is being used to control monkeypox outbreak. Interestingly, the modified vaccinia Ankara used by Imvanex is the same virus that serves as a viral vector vaccine vector for the Mvabea vaccine discussed above.
B.4 Subunits (recombinant protein) vaccine platform
Technical principle and bottlenecks. Compared to a whole bacterium or virion, subunit vaccines contain one or more isolated constituents of a microorganism to stimulate a more targeted immune response. Subtypes of subunit vaccines make use of protein subunits isolated and modified whole proteins or partial peptides from pathogens; toxoid inactivated pathogenic poisons; polysaccharides or glycoproteins to mimic glycoproteins on cell surface of pathogens; or chemical conjugation of low-affinity polysaccharides with a high affinity protein carriers to improve B-cell recognition of the polysaccharides. Usually, subunit vaccines elicit lower immunogenicity than WPV. Strategies to improve subunit vaccine effectiveness include the use of adjuvants, multiple dosage regiments, condon optimisation to improve yield, and amino acid substitution stabilise the introduced peptide chains [20].
Data. We use recombinant protein subunit vaccine, which has been widely available since the 1980s [94], as another baseline to compare with the novel mRNA and adenoviral vector vaccine platforms, and for similarity with the even more, established WPVs. Following the same logic, we choose two recently FDA-cleared subunit vaccines that both contain recombinant protein subunits and use adjuvants to enhance immune response: Nuvaxovid (COVID-19, Novavax, 2022) [95] and Shringrix (Shingles, GSK, 2017) [21].
Appendix C All figures
Appendix D Further methodological discussions
D.1 Natural experiment
Randomized controlled trials aim to eliminate selection bias, but are mostly only feasible in clinical trials. The estimation of average treatment effect comes with the important assumptions that the treatment of any participant does not have an effect on other participants and that all confounding covariates are accounted for. To mimic randomization in the economy where regions and stakeholders cannot be randomly assigned into groups, economists perform natural experiments where exogenous events “as if” randomly assign subjects into treatment and control groups. Notable examples of natural experiments include difference-in-differences, which estimates the average treatment effect by comparing treatment and control groups, who would otherwise move in parallel without the treatment, in multiple time panels [96, 97]; and regression discontinuity, which exploits abrupt changes, cutoffs, or thresholds to “as if” randomly assign samples into treatment and control groups [98, 99, 100, 101]. One validity requirement of natural experiments rests on the assumption that assignment is sufficiently “as if” random.
D.2 Main Path Analysis
Here we will give a brief summary of “main path analysis”. This was proposed by Hummon and Doreian [83] and, with variations, has been implemented in some popular analysis packages, for example see [39].
Main path analysis starts by looking at all possible paths between a specified set of nodes, a set which varies between various implementations of main path analysis. The number of these paths which pass through a given edge is used to assign the “edge weight” of that edge, i.e. a value assigned to that edge. Then the length of a path is defined to be the sum of the edges weights traversed in that path. Finally, the “main path” is defined using a greedy algorithm to find paths of high length as defined using the edge weights. That is to construct a main path, the current main path is extended by adding one more node to the end of the path such that the length increases by the largest amount. The main path will end when it reaches a sink node, a node with no outgoing edges.
The first main path analysis described the scientific advances that eventually led to the discovery of the DNA structure by Watsons and Crick [83]. Main path analysis is later applied to other networks of academic publications [102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112] and, more recently, patents [104, 113, 114, 115, 116, 117, 118, 40, 119, 84]. Case studies of main path analysis span from understanding the emergence of engineered products such as battery, nanotubes, automobiles, and semiconductors; to academic theories such as bioinformatics, social network analysis, absorptive capacity, Hirsch index, and peer review. Most of these case studies employ one of the four “out-of-the-box” indices developed by Hummon and Doreian [83] and Batagelj [39] with the objective to reduce the number of nodes in a citation network to a single chain of events to enable qualitative interpretation. However, these studies do not consider the longest path beyond a simplification device.
D.3 Edge weighting
We choose to assign a weighting of 1 to all edges because we assume each edge in the citation network, including in the longest path, represents a minimum viable increment of novelty. A weight of 1 allows equivalence in increments of innovation. We believe this assumption is valid because being published in a journal, accepted as a patent, approved to run a clinical trial, or authorised to market a therapeutic represents a minimum normalised threshold of originality from peer-review191919Anecdotally also known as the LPI or Least Publishable Unit https://en.wikipedia.org/wiki/Least_publishable_unit. i.e a group would not be able to publish any earlier and would not delay publication as they would seek to publish an increment as soon as possible. Conversely, any reweighting, such as in main path analysis, impedes interpretability. Future works can use funding amount as weight if data becomes more complete; whereas we do not recommend using time as edge weights because time is already implied in network height and an edge that consumes a long duration of time does not mean it is more novel.
D.4 Citation behaviours
When studying citation networks, it is important to note that citation practices vary. Different types of document may have different goals, and publishers set their own constraints on the bibliographies. The citation tradition in various fields can be very different, while individual authors add another source of variability. For instance, patent applicants need to strike the balance between minimising citations to demonstrate novelty and citing enough to not infringe prior arts [81, 120]. Based on Fig. 4, we can believe privately-funded publications are likelier to have end-uses in mind and may bias citations towards applied research; whereas publicly-funded publications may be preoccupied with phenomenological questions. In addition, funders enforce grant acknowledgements in publications and patents differently. For example, the US Bayh-Dole Act requires that all recipients of federal research funds report to the funding agency any patent they file and acknowledge on patent documents the existence of federal funding, while many other countries do not have similar requirements. The different citation behaviours are likely more pronounced in the multilayer citation network we use as it assumes publications, patents, clinical trials, and regulatory approvals cite in the same way.
D.5 Patent family
A patent family is a collection of patent applications covering the same or similar technical content. Patent families usually arise from a single invention being filed in multiple countries (“simple patent family”) and when an applicant files new applications for similar existing technical contents (“extended patent family”). Section D.6 below explains the importance of considering patent families in our network.
D.6 Patent prosecution
Patent application often spans several years. Four key dates in chronological order are:
Priority date: date used to establish the novelty of an invention 2. 2.
Filing date: when a patent application is first filed at a patent office 3. 3.
Publication date: when a patent application is published 4. 4.
Grant date: when a patent office grants a patent
Patent prosecution is the interaction among patent applicants, patent offices including examiners, and other interested parties. Patent prosecution usually spans between (2) filing date and (4) grant date, but can extend after grant if there is opposition, corrections, or other post-grant proceedings.
Due to patent prosecution, the bibliography of almost every patent is updated with new references. Almost every patent gets citations added during their prosecution time. These can be added by the examiner, by the applicant, assignee, or the inventor. What occurs less frequently is for citations to be added after grant. These usually happen for more limited reasons, e.g. post grant opposition, corrections, reissues, etc.
We use the initial patent submission date as our patent publication date. A year or two into the patent process, a recent paper can be added to the application, one that was published after the patent was submitted. As a result a patent may cite forward in time as well as the logically acceptable backwards in time. We could use the patent award date as our patent publication date which would solve the problem with the example just given. However, now we run into problems with documents that cite a patent that is not yet approved yet is a critical part of the innovation process. This illustrates why our using the height of a node in our citation network can be a more consistent record of the logical order in the innovation process compared to calendar time. We also address this issue by considering patent families rather than single patents when possible to capture references added to a patent during patent prosecution.
D.7 Critical path hit rate
The criticality of a given node is clearly defined in this paper by equation (1). Critical innovation path, on the other hand, depends on a threshold – nodes with a criticality below an arbitrary value would be considered residing on the critical innovation path. To provide a fair comparison across funders and across vaccines in Table 3, we define the critical path as nodes whose criticality is below the maximum height in a DAG multiplied by a criticality threshold . To determine the value of , we conducted a robustness check (Fig. D.1) to determine that the criticality threshold for the Shingrix network would be 0.35 as this is when most funders become present on the critical path. In Table 3, we only include funders who funded more than one node on the critical path and more than ten nodes in the entire network for meaningful comparison.
D.8 Network density
The density of nodes reflects ambiguity in the networks’ local and global order. Fig. D.2 shows the citation networks are densest at low height and sparsest at high height. The latter is due to dangling nodes, potentially due to incomplete citation data in early years, meaning these regions are sensitive to change. On the other hand, observations drawn from other heights are more stable.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Kline and N. Rosenberg, “Chain-linked model of innovation,” An Overview of Innovation: The Positive Sum Strategy. National Academy Press, Washington, DC, US , 1986.
- 2[2] R. R. Nelson and S. G. Winter, An evolutionary theory of economic change . Harvard University Press, 1982.
- 3[3] F. W. Geels, “Causality and explanation in socio-technical transitions research: Mobilising epistemological insights from the wider social sciences,” Research Policy , vol. 51, no. 6, p. 104537, 2022.
- 4[4] J. P. Vandenbroucke, A. Broadbent, and N. Pearce, “Causality and causal inference in epidemiology: the need for a pluralistic approach,” International Journal of Epidemiology , vol. 45, no. 6, pp. 1776–1786, 2016.
- 5[5] J. Pearl, “Causal diagrams for empirical research,” Biometrika , vol. 82, no. 4, pp. 669–688, 1995.
- 6[6] T. C. Williams, C. C. Bach, N. B. Matthiesen, T. B. Henriksen, and L. Gagliardi, “Directed acyclic graphs: a tool for causal studies in paediatrics,” Pediatric Research , vol. 84, no. 4, pp. 487–493, 2018.
- 7[7] M. Piccininni, S. Konigorski, J. L. Rohmann, and T. Kurth, “Directed acyclic graphs and causal thinking in clinical risk prediction modeling,” BMC Medical Research Methodology , vol. 20, no. 1, p. 179, 2020.
- 8[8] D. Acemoglu, U. Akcigit, and W. R. Kerr, “Innovation network,” Proceedings of the National Academy of Sciences , vol. 113, no. 41, pp. 11483–11488, 2016.
