Linked avian influenza epidemiological and genomic data in EMPRES-i for epidemic intelligence (2012–2021)

Nejat Arınık; Roberto Interdonato; Mathieu Roche; Maguelonne Teisseire

PMC · DOI:10.1016/j.dib.2025.111410·February 25, 2025

Linked avian influenza epidemiological and genomic data in EMPRES-i for epidemic intelligence (2012–2021)

Nejat Arınık, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire

PDF

Open Access

TL;DR

This paper introduces a dataset linking avian influenza genomic data with surveillance records to improve global monitoring and understanding of the disease's spread.

Contribution

The novel contribution is a dataset that integrates genomic sequences with EMPRES-i surveillance data for avian influenza, enabling better epidemic intelligence.

Findings

01

The dataset combines AI genomic data from BV-BRC with EMPRES-i records from 2012 to 2021.

02

Data linkage strategies and preprocessing methods are introduced to ensure high-quality integration.

03

The dataset is evaluated through a diffusion network inference task to demonstrate its utility.

Abstract

Due to its highly contagious nature, Avian Influenza (AI) is considered an animal health emergency affecting commercial sector and wild bird populations. Several genome sequencing databases have been created to help researchers understand how AI viruses evolve, spread, and cause disease. However, for a global epidemic monitoring approach, they need to be combined to public health surveillance systems, the well-one being EMPRES-i from the World Organisation for Animal Health (WOAH) and the Food and Agriculture Organization of the United Nations (FAO). This paper presents a new AI dataset, in which EMPRES-i is enriched thanks to the genome sequence data of Avian Influenza cases affecting bird species from 2012 to 2021, publicly provided by the Bacterial and Viral Bioinformatics Resource Center (BV-BRC). This dataset is obtained by automatically linking sequence information in BV-BRC to…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

Avian Influenza AI

Figures4

Click any figure to enlarge with its caption.

Workflow regarding our data processing and normalization tasks applied to the events of BV-BRC.Fig. 1Table 2Example of raw texts associated with an isolate record in $[eqn]$ regarding its spatial, temporal, disease and host information. These information can be sometimes problematic, as illustrated in this example. For the sake of completeness, we list here all possible issues with the raw texts. 1) Spatial: A region information can be provided with an ISO or ADM1 code. For instance, in this example the code *ENG* correspond to England. Furthermore, when several spatial attributes are provided

Evaluation of how close the computed isolate similarity scores in $[eqn]$ with the four strategies proposed in Section 4.3, compared to $[eqn]$ .Fig. 3

Our three inferred networks $[eqn]$ , $[eqn]$ and $[eqn]$ based on the AI H5N8 events in Europe for the period October-Decembre 2016. In (a), (b) and (c), a network is plotted twice for the sake of clarity. The first one corresponds to the obtained network and the second one represents the highlighted version of the first one. These highlights are based on the network pairs $[eqn]$ and $[eqn]$ , and show the evolution of the edges from $[eqn]$ to $[eqn]$ and from $[eqn]$ to $[eqn]$ , respectively. Finally, country codes in ISO 3166-1 alpha-2 standard are shown in all maps.Fig. 4

Keywords

Avian influenzaEpidemiologyData linkageTransmission dynamicsNetwork inference

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsZoonotic diseases and public health · Influenza Virus Research Studies · Data-Driven Disease Surveillance

Full text

Specifications TableSubjectComputer Science: Information SystemSpecific subject areaLinked Avian Influenza Epidemiological and Genomic DataType of dataTabular data (*.csv). Raw and Standardized.Data collectionThe surveillance data were retrieved from EMPRES-i and the sequence information from BV-BRC. Disease: Avian Influenza, Host: Birds, Study period: 2012–2021.Data source locationThe data are hosted on the INRAE Dataverse in the context of the MOOD (MOnitoring Outbreaks for Disease surveillance in a data science context) project1.Data accessibilityRepository name: Data INRAE (Dataverse) Data identification number: doi: 10.57745/JNA7N9 Direct URL to data: https://doi.org/10.57745/JNA7N9Related research articleNone.

Value of the Data

1

•This dataset contributes to the available resources in the field of Avian Influenza surveillance and epidemic intelligence.
•It completes the genetic information of the spatio-temporal AI events.
•It is useful for epidemiologists and computer scientists for studying AI transmission dynamics.
•It can be used for evaluation or training purposes for classification and network inference tasks.

Background

2

The emergence and spread of Avian Influenza (AI) has serious consequences for animal health and a substantial socio-economic impact for agriculture. For instance, the 2021–2022 season have witnessed the largest observed highly pathogenic avian influenza (HPAI) cases in Europe so far, with a total of 2467 outbreaks in poultry, 3573 HPAI events in wild birds, and 48 million birds culled in the affected establishments2. Due to this highly contagious nature, it is critical to monitor the ongoing AI cases. To this aim, epidemic intelligence has been used to remedy this animal health emergency.

For a global epidemic monitoring approach, several national and international surveillance systems are used, the well-known one being the EMPRES-i database from the World Organisation for Animal Health (WOAH) and the Food and Agriculture Organization of the United Nations (FAO) [1]. This database regularly collects structured and verified official animal health threats, hereafter referred to as epidemiological events (or events for short), through routine national surveillance systems and public health authorities. As a result, it is a well-populated official database for Avian Influenza and has been often used as reference gold standard in the literature [2], [3], [4].

Currently, EMPRES-i does not provide any linkage between its epidemiological events and the corresponding genome sequence information. However, combining epidemiological information and geomapping in the analysis of AI can contribute to a better understanding and description of AI epidemiology. In the literature, [5] has already proposed in 2013 to enhance the EMPRES-i database for H5N1 and H7N9 serotypes, but their genetic module is not operational anymore. For this reason, we propose in this paper a new AI dataset, in which EMPRES-i is enriched with the genome sequence data of AI cases, publicly provided by the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) [6]. This new dataset concerns the AI events in EMPRES-i, affecting bird species from 2012 to 2021. It is worth highlighting that the AI host types (e.g. mammals) other than birds are not in scope of this work.

Data Description

3

Our goal in this work is to enrich the AI cases in the EMPRES-i database $[eqn]$ with genetic information provided by the BV-BRC database $[eqn]$ . As explained later in Section 3.2, we employ two linkage strategies to associate the genetic information in $[eqn]$ to $[eqn]$ : 1-to-1 and 1-to-many linking. In the first one, a genome sequence can be associated to only one EMPRES-i event, whereas this unicity constraint is omitted in the second one in order that a genome sequence can be linked to multiple EMPRES-i events. The first (resp. second) strategy is more strict (resp. relaxed) and always produces less (resp. same or more) linked data compared to the other strategy. Ideally, the 1-to-1 strategy must be the only choice for such a task. However, due to possibly erroneous and imperfect information in our databases, it might be beneficial to use in practice the 1-to-many strategy to have more linked cases between $[eqn]$ and $[eqn]$ , depending on the application at hand. Finally, the application of these two strategies results in two datasets, that we call $[eqn]$ and $[eqn]$ , respectively. We detail their distributions per year and disease serotype in Table 1. In total, $[eqn]$ and $[eqn]$ contain 4797 and 13,300 events, respectively.Table 1. Statistics on the two datasets $[eqn]$ and $[eqn]$ . The columns represent all possible AI strains found in $[eqn]$ , grouped by the $[eqn]$ subtype for simplicity, and the rows correspond to yearly periods from 2012 to 2021. The last column (resp. row) summarizes the statistics by row (resp. column). Finally, each entry in the table has the form of $[eqn]$ , in which $[eqn]$ , $[eqn]$ and $[eqn]$ represent the number of events in $[eqn]$ , $[eqn]$ and $[eqn]$ , respectively.Table 1. YearH3H5H6H7H9H10Total20120/0/0231/544/7320/0/06/42/52185/187/1920/0/0422/773/976****20130/0/0164/474/6880/0/0273/347/430172/172/1801/1/1610/994/1,299****20140/0/0263/964/1,2660/0/0600/690/70524/25/262/2/3889/1,681/2,000****20150/0/0992/2,662/3,0620/0/087/303/31216/16/170/0/01,095/2,981/3,391****20160/0/0453/1,548/2,0490/0/0102/192/2349/9/210/0/0564/1,749/2,304****20170/0/0561/2,360/3,4490/0/1266/1,035/1,0404/4/60/0/0831/3,399/4,496****20180/0/0162/403/7800/0/09/18/314/4/40/0/0175/425/815****20197/7/729/86/2390/0/01/8/621/1/10/0/031/95/309****20200/0/089/414/1,5540/0/01/3/440/0/00/0/090/417/1,598****20210/0/089/785/3,7230/0/01/1/70/0/00/0/090/786/3,729Total7/7/73,033/10,240/17,5420/0/11,346/2,639/2,917415/418/4473/3/44,797/13,300/20,918

It is worth mentioning that both $[eqn]$ and $[eqn]$ are not the fusion of the AI cases from $[eqn]$ and $[eqn]$ , rather they are directly related to the EMPRES-i database. For this reason, they contain only the serotypes and avian host names available in $[eqn]$ (see the online supplementary material for all the available information). Although there are some discrepancies in the disease/host focus of both databases and they are therefore complementary (see the online supplementary material for more details and comparative results), merging them to obtain a single large AI database is not in scope of this work.

Experimental Design, Materials and Methods

4

This section details how we obtain our final datasets $[eqn]$ and $[eqn]$ by using data normalization (Section 3.1) and data linkage strategies (Section 3.2).

Data processing and normalization

4.1

In this section, we describe the data processing and normalization tasks applied to $[eqn]$ , as illustrated in Fig. 1. These tasks are required for data linkage explained later in Section 3.2. Here, the data processing operations aim to clean and reformat the raw entries, and to complete the missing information, if possible. These are essential operations, because raw entries can be sometimes problematic, as illustrated with an example in Table 2.Fig. 1. Workflow regarding our data processing and normalization tasks applied to the events of BV-BRC.Fig. 1. Table 2Example of raw texts associated with an isolate record in $[eqn]$ regarding its spatial, temporal, disease and host information. These information can be sometimes problematic, as illustrated in this example. For the sake of completeness, we list here all possible issues with the raw texts. 1) Spatial: A region information can be provided with an ISO or ADM1 code. For instance, in this example the code ENG correspond to England. Furthermore, when several spatial attributes are provided, they might not be ordered in a hierarchical manner. For instance, in this example Skelmersdale is a town in England. Therefore, it must be placed after England in order to respect the hierarchal order. 2) Temporal: This information can be partially complete. For instance, in this example the day information is missing. 3) Disease: This information can be partially complete. For instance, in this example the N subtype information is missing (e.g. the N1 part in H5N1). 4) Host: This information can be very detailed (e.g. with gender and age) and can have multiple host keywords. For instance, in this example the keywords mallard, duck and bird are found in the same description.Table 2. LocationDateDiseaseHostIsolateUnited Kingdom, Skelmersdale, ENG03–2021H5mallard duck; bird; gender Female; age Adult{PB1: 11320.124852, PB2: 11320.124884}

For the normalization task, we use the same normalization operations used in [4], applied for $[eqn]$ . This task consists in normalizing the attributes of each event by transforming a raw text into one of well-defined taxonomy classes (i.e. hierarchical representation), assuming that the events are defined as in Section 3.1.1. Concretely, these processing and normalizing operations concern the spatial (Section 3.1.2), temporal (Section 3.1.3), disease (Section 3.1.4) and host (Section 3.1.5) information of the events in $[eqn]$ .

Event definition

4.1.1

We define an event throughout this work as the detection of the AI virus for a specific host at a specific date and in a specific location. Moreover, we also consider its genetic information, when available. For instance, this information is available for BV-BRC and our final datasets $[eqn]$ and $[eqn]$ , but not for EMPRES-i. All these different information constitute the attributes of an event. Note that in an event, a location is expressed as the names of polygons (e.g. country or city names), but its spatial coordinates can be easily retrieved through a geocoding tool thanks to our normalization step (see Section 3.1.2). Moreover, the genetic information in $[eqn]$ is organized as virus isolates. An isolate is the name for a virus that we have isolated from an infected host. In an AI isolate, there are in total eight segments: PB1 (polymerase basic 1), PB2 (polymerase basic 2), PA (polymerase acidic), HA (hemagglutinin), NP (nucleoprotein), NS (nonstructural protein), NA (neuraminidase) and M (matrix protein) [7]. However, in some isolates in $[eqn]$ , it is possible to have only some of these segments, which gives the partial view of an isolate. For instance, in the example of Table 2, only the segments PB1 and PB2, out of eight, are present. Note that we even make use of these partial isolates in data linkage strategies explained in Section 3.2.

For comparison purposes, the attributes of an event are usually normalized. This normalization step allows representing an event attribute in a hierarchical manner, thanks to well-defined hierarchical taxonomy classes. For instance, we obtain the normalized event illustrated in Table 3, after the event normalization task is applied to the raw entries in Table 2. Note that each event attribute can have a different hierarchical level. For the sake of compactness and simplicity, we show in this work only the information available at the finest-grained level for each event attribute. For instance, the compact view of the normalized event in Table 3 is illustrated in Table 4.Table 3. Normalized event representation in a hierarchical manner for the raw texts of an isolate record illustrated in Table 2 after the event normalization task is applied (see Section 3.1 for more details). For the sake of simplicity, we represent the isolate information as an additional event attribute, although it is disease-related information.Table 3. Hierarchy levelLocationDateDiseaseHostIsolate1Europe2021avian fluaves (bird){PB1: 11320.124852, PB2: 11320.124884}2United Kingdom03–2021H7neognathae3Englandweek 13H7N9galloanserae4Lancashire31-03-2021anseriformes5West Lancashireanatidae6Skelmersdaleanatinae7anas (duck)8anas platyrhynchos (mallard duck)Table 4. Compact view of the normalized event in Table 3. Each column corresponds to an event attribute. We show in these columns only the information available at the finest-grained level.Table 4. LocationDateDiseaseHostIsolateSkelmersdale31-03-2021H7N9 serotypemallard duck{PB1: 11320.124852, PB2: 11320.124884}

Spatial information

4.1.2

Each AI case in $[eqn]$ has the spatial information. But, this information can be at different spatial scale from one case to another (country, city, etc). Next, we describe our spatial entity disambiguation and normalization steps.

First, we need to perform spatial entity disambiguation. Indeed, due to the hierarchical nature of this information, some values can be ambiguous, because there is not any rule regarding the attribute order. For instance, the city information can randomly be preceded or succeeded by its region name (see Table 2 for an example). This makes the normalization task difficult. Therefore, we use three geocoding tools (ArcGIS3, Nominatim4, GeoNames5) to solve this attribute order issue. The goal is not to normalize spatial entities, rather identifying which part of the text corresponds to spatial entity attributes. For instance, after solving the attribute order issue in Table 2, we find out that Skelmersdale is a town, which is contained in England.

Then, we perform the normalization of spatial entities. This task consists in assigning geographic coordinates to spatial entities. In this work, we perform this task with the gazetteer GeoNames, as done in [4]. For a given query of spatial entity, GeoNames outputs a ranked list of most appropriate geographic coordinates associated with the input text. We simply take the first result, associated with the desired country name. For instance, if GeoNames proposes two results for Skelmersdale with two different country information (e.g. United Kingdom and Sweden), then we keep the result with United Kingdom, which is the desired country name according to Table 2.

Temporal information

4.1.3

Each AI case in $[eqn]$ has also the collected date information, which is in the form of YYYY-MM-DD. However, this information in several cases is partially complete, in that the day and/or month information is missing. We handle these incomplete dates with two strategies. If the temporal information only misses the day attribute, we simply consider it the first day of its month. Otherwise, when both the day and month attributes are missing, we duplicate the event 12 times, one for each month. The last operation aims to ease the data linkage process between $[eqn]$ and $[eqn]$ . Finally, we normalize the temporal expressions according to the TIMEX3 annotation standard.6

Disease information

4.1.4

The serotype information of some AI cases in $[eqn]$ are partially completed, for instance H5 or N1 instead of H5N1. To make the data at hand more available in data linkage, we estimate their exact serotype information, thanks to our auxiliary database $[eqn]$ of isolate similarity scores, obtained from all pairs of events in $[eqn]$ . This $[eqn]$ auxiliary database is more detailed later in Section 3.2. Concretely, for a given event with partial serotype information, we first take from $[eqn]$ the isolate similarity scores between the isolate in question and most likely other isolates, then the isolate with highest similarity score determine its exact serotype. For instance, if the serotype is H5, then we select in $[eqn]$ all isolates with the H5 subtype (e.g. H5N1, H5N2, etc.). We normalize these disease values with custom taxonomy classes in order to group the serotypes within the same H subtype (e.g. H5N1 and H5N2 are grouped for H5).

Host information

4.1.5

The host information can be very detailed (e.g. with gender and age). For this reason, we select only avian names through the NCBI Taxonomy database. In the end, an AI case in $[eqn]$ can have multiple host keywords extracted (see Table 2 for an example). Then, for each AI case, we normalize these host keywords against the NCBI Taxonomy database [8], using a manually composed table of species name synonyms. Then, we keep the host name, which is at finest-grained level. For instance, if the keywords mallard duck and duck are both present, we keep only mallard duck.

Data linkage

4.2

In this section, we take in input the preprocessed and normalized events (hereafter, simply events) from $[eqn]$ and $[eqn]$ , as explained in Section 3.1. Our goal is to identify common (i.e. ”putatively” linked) events between $[eqn]$ and $[eqn]$ in an automatic manner, which is not a trivial task. We illustrate in Fig. 2 the workflow regarding this data linkage task. In the following, we first introduce how we compute the similarity of two events (Section 3.2.1), then pass to the data linkage strategies (Section 3.2.2).Fig. 2. Workflow for event linking.Fig. 2

Event similarity

4.2.1

For event linking, we need to assess the similarity of two events in the presence of hierarchical data. This hierarchical data requires us to rely on an ontology-based semantic similarity measure. Due to this specificity, we use the similarity measure proposed in [4], which is similar to the state of the art measures, but tailored for epidemiological events. For the sake of clarity, we briefly explain it here.

In the similarity assessment, the main idea is that two events are considered similar, if 1) all their event attributes are identical or hierarchically linked, and 2) their event dates are close enough. Otherwise, they are penalized with a large negative value in the calculation in order that the underlying events are not linked. We slightly change the calculation proposed in [4] for the spatial attribute, tough. In its initial version, two events of different countries cannot be linked. We change it in order that two events of neighboring or geographically close countries to be linked. Concretely, we first calculate the similarity for each event attribute, and then we sum up the obtained values in order to get the final score. In the end, we obtain a similarity matrix, in which each entry correspond to the similarity score for a given event pair from $[eqn]$ and $[eqn]$ (see Fig. 2 for an illustration).

Data linkage strategies

4.2.2

In this section, we present two linkage strategies: 1) 1-to-1 (Section 3.2.2.1) and 2) 1-to-many (Section 3.2.2.2) linking. In the first strategy, an event in $[eqn]$ can be associated with only one event in $[eqn]$ , whereas in the second strategy the same event in $[eqn]$ can be mapped to multiple events in $[eqn]$ . In both strategies, we rely on the similarity matrix $[eqn]$ of $[eqn]$ and $[eqn]$ . The term $[eqn]$ represents the similarity score between events $[eqn]$ and $[eqn]$ , and it is calculated as described in Section 3.2.1.

1-to-1 Event Linking As a first strategy, an event in $[eqn]$ can be associated with only one event in $[eqn]$ . We propose to model this task as an assignment problem based on the matrix $[eqn]$ , as already done in the literature (e.g. [9]). It can be solved through the well-known Hungarian algorithm [10]. In the end, we obtain a set of ”putatively” associated events between $[eqn]$ and $[eqn]$ . Finally, in the solution of the assignment problem, some events might be assigned to other events with negative or weak positive similarity scores. Therefore, we perform a post-processing by removing the assignment results, whose similarity scores are lower than some threshold value. In the end, we obtain our final dataset $[eqn]$ .

1-to-many Event Linking As a second strategy, an event in $[eqn]$ can be associated with multiple events in $[eqn]$ in order to cover as many events as possible in $[eqn]$ . This means that each event in $[eqn]$ is associated to the event in $[eqn]$ with highest similarity score. Concretely, for each event in $[eqn]$ , we take the column-wise maximum in $[eqn]$ . Finally, as in Section 3.2.2.1, some events might be assigned to other events with negative or weak positive similarity scores. Therefore, we only keep assignment results, whose similarity scores are lower than some threshold value. In the end, we obtain our final dataset $[eqn]$ .

Limitation: Missing Isolate Information

5

In our final datasets $[eqn]$ and $[eqn]$ , some of their events has missing isolate information after the data linkage process, due to the different data sizes in $[eqn]$ and $[eqn]$ and possibly erroneous and imperfect information in these sources. In this section, the goal is to show how we manage to handle the missing isolate information in $[eqn]$ and $[eqn]$ , which can be beneficial for an application at hand (e.g. as our practical case in Section 6). In the following, we first introduce how we define isolate similarity (Section 4.1), then present the construction of an auxiliary database of isolate similarity scores (Section 4.2). Finally, we present our four strategies for the missing isolate information (Section 4.3).

Isolate similarity

5.1

Recall that in an AI isolate, there are in total eight segments: PB1 (polymerase basic 1), PB2 (polymerase basic 2), PA (polymerase acidic), HA (hemagglutinin), NP (nucleoprotein), NS (nonstructural protein), NA (neuraminidase) and M (matrix protein) [7]. Each segment is associated with a genome sequence. In the isolate similarity assessment, we compute the similarity value for each same segment pair (e.g. PB1 vs. PB1, PB2 vs. PB2), then take its average to obtain the final similarity score.

To compute the similarity of two segments (i.e. genome sequence), we rely on pairwise sequence alignment. This is the process of aligning two sequences to each other by optimizing the similarity score between them based on a predefined substitution matrix [11]. In this work, we use a default substitution matrix proposed in the Bio.Align Python package [12]. Finally, the obtained raw similarity score from the substitution matrix is normalized by the maximum similarity score obtained when each sequence is compared to itself.

Construction of an auxiliary database: isolate similarity scores

5.2

We take advantage of the large size of genome sequence information provided by BV-BRC to constitute an auxiliary database $[eqn]$ of isolate similarity scores. These scores are obtained with the similarity measure explained in Section 4.1 for all pairs of temporally close events in $[eqn]$ with complete serotype information (e.g. H5N8 vs. H5N8, H5N8 vs. H5N1). The temporal distance is fixed in such a way that two events of the same year or subsequent years are only kept (e.g. 2017 vs. 2017, 2017 vs. 2018, 2018 vs. 2017). Note that this auxiliary database is used for two purposes: 1) completion of partial serotype information (Section 3.1.4) and 2) handling missing isolate information in the datasets $[eqn]$ and $[eqn]$ (Section 4.3).

As an example, we show in Table 5 an excerpt from these scores for only the events with H5N8 serotype, occurring in South Africa and Namibia. In this table, we call source and target events to distinguish two events in the similarity calculation. Moreover, to show how many similarity scores are computed by serotype pair in $[eqn]$ , we show some statistics in Table 6. The very large numbers in this table highlight the importance of $[eqn]$ and its capability of precise estimations in any task.Table 5. Excerpt from $[eqn]$ . Only some events with H5N8-H5N8 serotype pair, occurring in South Africa and Namibia, are shown. In this table, we call source and target events to distinguish two events in the similarity calculation.Table 5. Source country vs. Target countrySource sequence nameSource yearTarget sequence nameTarget yearSimilarity scoreA/African penguin/ South Africa/18010422/201820180.99South Africa vs South AfricaA/African oystercatcher/ South Africa/18030214/20182018A/Guinea fowl/ South Africa/17080243/201720170.92A/African penguin/ South Africa/476266/201820180.92A/African penguin/ Namibia/ 146S/201920190.71South Africa vs NamibiaA/African oystercatcher/ South Africa/18030214/20182018A/African penguin/ Namibia/ 218-1/201920180.61A/African penguin/ Namibia/ 288-1/201920180.80Table 6Sizes of some serotype pairs in $[eqn]$ .Table 6. Serotype pairSizeH5N1 vs H5N11,308,268H5N8 vs H5N8605,345H7N9 vs H7N9545,614H9N2 vs H9N234,845,712H5N1 vs H5N8567,071H5N8 vs H7N9356,169

Handling missing isolate information

5.3

If two events in $[eqn]$ have their associated isolate information, we can simply compute the isolate similarity between them. However, despite of two linkage strategies proposed in Section 3.2.2, it is possible not to assign an isolate to an event in $[eqn]$ . This can be due to automatic normalization issues or incompleteness of $[eqn]$ . To overcome this issue, we propose four strategies for handling the absence of isolation information in $[eqn]$ and $[eqn]$ . The goal here is to take advantage of $[eqn]$ in order to compute average similarity scores with respect to some selected event attributes.

Given event pairs, the first and second strategies are used when only one event has missing isolate information, and not the other one. In this case, the one with isolate information is referred to as source event, and the other as target event. The first strategy is used, when the sequence information of the source event, i.e. source sequence, and the country information of the target event, i.e. target country, are known in $[eqn]$ , given a serotype pair. We illustrate this with an example in Table 7. For instance, the first row corresponds to the average similarity score between A/African oystercatcher/South Africa/18030214/20187 and South Africa. This similarity score is obtained by taking the average of all similarity scores obtained from event pairs, both occurring in South Africa with known isolate information. Otherwise, if the first strategy cannot be used, we use the second strategy, in that only source sequence is used to compute an average similarity score without taking the target country into account, as illustrated in Table 8.Table 7. Illustration of the first strategy with H5N8-H5N8 serotype pair, which is used when only one event has missing isolate information.Table 7. Source sequence nameTarget countrySimilarity scoreA/African oystercatcher/South Africa/18030214/2018South Africa0.95Zimbabwe0.92Belgium0.53A/African penguin/Namibia/146S/2019Namibia0.99Nigeria0.79Pakistan0.64Table 8Illustration of the second strategy with H5N8-H5N8 serotype pair, which is used when only one event has missing isolate information.Table 8. Source sequence nameSimilarity scoreA/African oystercatcher/South Africa/18030214/20180.85A/African penguin/Namibia/146S/20190.72

The third and fourth strategies are used when none of the two events has an isolate information. The third strategy relies on the country information of both events. If a pair of county names, for a given serotype, is known in $[eqn]$ , then we compute the average similarity score by taking all similarity scores obtained for event pairs with known isolate information, occurring in both countries. This is illustrated in Table 9. Otherwise, we use the fourth strategy. In this case, we compute the average similarity score by taking all similarity scores obtained from each pair of events with complete isolate information, without taking the country information into account. This is illustrated in Table 10.Table 9. Illustration of the third strategy with H5N8-H5N8 serotype pair, which is used when none of the two events has an isolate information.Table 9. Country pairSimilarity scoreSouth Africa vs South Africa0.97South Africa vs Zimbabwe0.96South Africa vs Belgium0.56Namibia vs Namibia0.99Namibia vs Nigeria0.79Namibia vs Pakistan0.68Table 10Illustration of the fourth strategy with H5N8-H5N8 serotype pair, which is used when none of the two events has an isolate information.Table 10. Serotype pairSimilarity scoreH5N5 vs H5N50.99H5N8 vs H5N80.85H5N5 vs H5N80.70

Quantitative Evaluation

6

In this section, we evaluate the proposed strategies to deal with the completion of partial serotype information (Section 3.1.4), data linkage between two event databases (Section 3.2.2) and handling missing isolate information in $[eqn]$ and $[eqn]$ (Section 4.3). For these assessments, we create a subset $[eqn]$ of events from our data, which contains in total 500 randomly sampled events with complete isolate information. Since all the events in $[eqn]$ have complete isolate information, we use the dataset $[eqn]$ in our assessments as the ground-truth. Next, we detail our three quantitative evaluation tests by using $[eqn]$ and show their corresponding results.

First, we evaluate how successful our proposed completion strategy in Section 3.1.4 is for dealing with partial serotype information. For this assessment, we create another dataset $[eqn]$ of events by duplicating $[eqn]$ and making the disease serotype information of all its 500 events partially complete (e.g. H5 or N1 instead of H5N1). Then, we perform the completion of partial serotype information based on the auxiliary database $[eqn]$ , as explained in Section 3.1.4, in order to compare the results with $[eqn]$ . As a result, our evaluation test finds out that the proposed strategy correctly estimate the complete serotype information in 444 events (0.89 in proportion).

Second, we are also interested in the evaluation of the event linking process between two event databases, as explained in Section 3.2.2. For this assessment, we first create multiple datasets of events by duplicating $[eqn]$ and perturbing the events to the extent of the perturbation parameter $[eqn]$ , which is in the range of [0,1]. Concretely, the perturbation process first randomly selects with the probability of $[eqn]$ the attributes of an event for which the modification is done, and it then makes the selected attributes coarser (i.e less precise) based on the corresponding taxonomy trees. When the value of $[eqn]$ is close to 0 (resp. 1), this means that the events of $[eqn]$ are modified to small (resp. large) extent and they are very (resp. not very) close to their initial counterpart. In our evaluation test, we use four $[eqn]$ values, which are 0.25, 0.50, 0.75 and 1.00, and this results in four datasets of events, which we call $[eqn]$ , $[eqn]$ , $[eqn]$ and $[eqn]$ , respectively.

Then, we apply the 1-to-1 and 1-to-many event linking strategies between $[eqn]$ and all four perturbed datasets of events. Ideally, the linking process is supposed to link the same events, which can be verified based on their event identifiers. If the process finds (resp. does not find) the same events, we say that they are correctly (resp. falsely) linked. It is also possible that the linking process fails to link some event pairs in two event datasets (i.e. unlinked cases). We show in Table 11 the proportion of the correctly and falsely linked event pairs, as well as that of unlinked cases, for four perturbed event datasets. We see from the table that the performance of event linking gets worse when the perturbation degree increases, as expected. However, the proportion of correctly linked cases is still large enough (i.e. the scores of 0.75 and 0.81), even when $[eqn]$ .Table 11. Evaluation of the event linking process between $[eqn]$ and four perturbed datasets of events $[eqn]$ , $[eqn]$ , $[eqn]$ and $[eqn]$ .Table 11. StrategyDescriptionEvaluation with $[eqn]$ Evaluation with $[eqn]$ Evaluation with $[eqn]$ Evaluation with $[eqn]$ 1-to-1Proportion of correctly linked cases0.950.920.850.75Proportion of falsely linked cases0.030.050.080.14Proportion of unlinked cases0.020.030.070.111-to-manyProportion of correctly linked cases0.980.940.880.81Proportion of falsely linked cases0.010.020.050.07Proportion of unlinked cases0.010.030.070.12

Finally, we also assess how correct the estimation of the applied four strategies is for handling missing isolate information in $[eqn]$ and $[eqn]$ (Section 4.3). Recall that these strategies are applied when at least one event has missing isolate information for the isolate similarity calculation of two events. We perform our evaluation test in two parts. In the first part, we explore to what extent the proposed four strategies are in practice used. To do so, we create another dataset $[eqn]$ of events by duplicating $[eqn]$ and removing the isolate information of its 250 events. We show in Table 12 the proportion of use of these four strategies in the pairs of events in $[eqn]$ , when when at least one event has missing isolate information. We see from this table that the strategies 1 (with the score of 0.54) and 3 (with the score of 0.30) are prevalently used in practice.Table 12. Proportions of use of four strategies proposed in Section 4.3 for the pairs of events in $[eqn]$ , when when at least one event has missing isolate information (see the column Isolate information).Table 12. StrategyIsolate informationProportion of use1 (source sequence vs. target country)only one of the events0.542 (source sequence only)only one of the events0.053 (source country vs. target country)none of events0.304 (default, serotype pair)none of events0.11

In the second part, we also rely on $[eqn]$ and $[eqn]$ , and assess how close the computed isolate similarity scores in $[eqn]$ are after the estimation with the four strategies, compared to $[eqn]$ . For this assessment, we first separately compute the isolate similarity among pairs of events in $[eqn]$ and $[eqn]$ . Then, we calculate the absolute difference values for the same event pairs in $[eqn]$ and $[eqn]$ to see how close these results are. Note that the same event pairs in $[eqn]$ and $[eqn]$ can be verified based on their event identifiers. We show in Fig. 3a the absolute difference values between $[eqn]$ and $[eqn]$ . We see that approximately 80 % of the estimated scores are in the error range of [0,0.1] (i.e. yellow and green bars). Furthermore, for the sake of completeness, we also show in Fig. 3b the distribution of the calculated similarity scores in $[eqn]$ and $[eqn]$ , before the calculation of absolute difference values. We observe that their overall distributions are sufficiently similar, with some small skewness differences.Fig. 3. Evaluation of how close the computed isolate similarity scores in $[eqn]$ with the four strategies proposed in Section 4.3, compared to $[eqn]$ .Fig. 3

Practical Case

7

To illustrate the usefulness of our datasets of AI events, we present a diffusion network inference task at meta-population level, publicly available online8. A network inference problem consists in estimating the underlying network structure, i.e. complete information on edge connectivity, node existence and the exact edge weights, from the event data at hand. In our context, the nodes and edges in the network to be inferred correspond to the spatial zones at ADM1 level (i.e. first level of subnational boundaries) and the disease transmissions among them, respectively. Generally speaking, we only know when an AI event occurs, but not exactly from where it is propagated, i.e. the underlying transmission dynamics among the zones. Hence, this network inference task aims to unveil the hidden AI transmission information in the presence of the temporal, spatial and genetic information of AI events.

To perform this task we adapt the method proposed in [13] to our dataset $[eqn]$ 9 Briefly, [13] adopts a space-time diffusion model and a survival analysis framework for estimating the network structure. We simply extend their work by including the genetic information of AI events. Similar to [13], we use Rayleigh distribution with the parameter $[eqn]$ to model the temporal distances among the events, and Exponential distribution with the parameters $[eqn]$ and $[eqn]$ for the spatial and genetic distances, respectively. The values of $[eqn]$ are what we estimate from the network inference problem and the values of the parameters $[eqn]$ and $[eqn]$ are fixed to 0.01.

In this practical case, for illustrative purposes we select only a subset of our dataset, corresponding to the AI H5N8 events occurred between October and December 2016 in Europe. This period corresponds to the beginning of the H5N8 wave, which is the largest in the EU in terms of number of poultry outbreaks, geographical extent and number of dead wild birds. [14]. There are in total 606 events, in which 75 events do not have any isolate information despite of our data linkage strategy. We rely on the four strategies of handling mission isolate information, as explained in Section 4, in order to compute the isolate similarity values among these 75 events and the rest. To show the interest of including additional information, we sequentially infer three networks $[eqn]$ , $[eqn]$ and $[eqn]$ for time-only, space-time and space-time-genetic information, respectively. We filter out the edges, whose weight is lower than 0.05 to keep only the pertinent ones. We evaluate the obtained results in a qualitative manner based on the phylogenetic analysis conducted by [15], which estimates the transmission flows among AI H5 cases in Eurasia for the period 2016-17. In [15], the authors mainly find out that the virus is carried by wild birds during autumn migration 2016 to wintering locations in Europe through two main flows: 1) Russia $[eqn]$ countries around Baltic Sea $[eqn]$ Netherlands $[eqn]$ France and 2) Russia $[eqn]$ Ukraine $[eqn]$ Hungary.

We first visualise the inferred networks $[eqn]$ , $[eqn]$ and $[eqn]$ in Fig. 4. In this figure, a network is plotted twice for the sake of clarity. The first one corresponds to the obtained network and the second one represents the highlighted version of the first one. These highlights are based on the network pairs $[eqn]$ and $[eqn]$ , and show the evolution of the edges from $[eqn]$ to $[eqn]$ and from $[eqn]$ to $[eqn]$ , respectively. An edge is colored in black, if it exists in both networks. Otherwise, its color is red in $[eqn]$ (resp. $[eqn]$ ), if it appears only in $[eqn]$ (resp. $[eqn]$ ) and not in $[eqn]$ (resp. $[eqn]$ ). These red edges $[eqn]$ (resp. $[eqn]$ ) indicate that they are filtered out from $[eqn]$ (resp. $[eqn]$ ). Similarly, an edge is colored in green in $[eqn]$ (resp. $[eqn]$ ), if it appears only in $[eqn]$ (resp. $[eqn]$ ) and not in $[eqn]$ (resp. $[eqn]$ ). These green edges in $[eqn]$ (resp. $[eqn]$ ) indicate that they are inferred thanks to the inclusion of the spatial and (resp. spatio-genetic) information.Fig. 4. Our three inferred networks $[eqn]$ , $[eqn]$ and $[eqn]$ based on the AI H5N8 events in Europe for the period October-Decembre 2016. In (a), (b) and (c), a network is plotted twice for the sake of clarity. The first one corresponds to the obtained network and the second one represents the highlighted version of the first one. These highlights are based on the network pairs $[eqn]$ and $[eqn]$ , and show the evolution of the edges from $[eqn]$ to $[eqn]$ and from $[eqn]$ to $[eqn]$ , respectively. Finally, country codes in ISO 3166-1 alpha-2 standard are shown in all maps.Fig. 4

We can summarize the results in three points. First, we see from Fig. 4 that the network inference task with additional spatial and genetic information make the networks sparser (i.e. $[eqn]$ compared to $[eqn]$ and $[eqn]$ compared to $[eqn]$ ), as indicated with the existence of multiple red edges in $[eqn]$ for $[eqn]$ and $[eqn]$ for $[eqn]$ . This can be seen as a filtering step towards reaching most likely transmission pathways. For instance, including the genetic information allows keeping only the transmission from Netherlands to the North of France by filtering out the transmission from Germany. Second, adding the genetic information does not add any new (i.e. green) edges in $[eqn]$ . Moreover, we only observe some slight differences between $[eqn]$ and $[eqn]$ . This indicates that the spatial and genetic distances between events are mostly correlated. Finally, our findings are mainly in line with the results in [15]. Overall, the benefit of including the genetic information is visually shown.

Ethics Statement

No conflict of interest exists in this submission. The authors declare that the work described in this paper is original and not under consideration for publication elsewhere, in whole or in part. Its publication is approved by all the authors listed.

CRediT authorship contribution statement

Nejat Arınık: Methodology, Software, Data curation, Writing – review & editing. Roberto Interdonato: Writing – review & editing. Mathieu Roche: Writing – review & editing. Maguelonne Teisseire: Writing – review & editing.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1FAO, EMPRES global animal disease information system (EMPRES-i), 2021, Accessed on 23 October 2024, http://empres-i.fao.org/empres-i, licence CC-BY-4.0.
2Arsevska E.Valentin S.Rabatel J.de Goër de HervéJ.Falala S.Lancelot R.Roche M.Web monitoring of emerging animal infectious diseases integrated in the french animal health epidemic intelligence system P Lo S ONE 138201812510.1371/journal.pone.0199960 PMC 607574230074992 · doi ↗ · pubmed ↗
3Valentin S.Boudoua B.Sewalk K.Arinik N.Roche M.Lancelot R.Arsevska E.Dissemination of information in event-based surveillance, a case study of avian influenza P Lo S ONE 1892023 e 028534110.1371/journal.pone.0285341 PMC 1047989637669265 · doi ↗ · pubmed ↗
4Arınık N.Interdonato R.Roche M.Teisseire M.An evaluation framework for comparing epidemic intelligence systems IEEE Access 112023318803190110.1109/ACCESS.2023.3262462 · doi ↗
5Claes F.Kuznetsov D.Liechti R.Von Dobschuetz S.Truong B.D.Gleizes A.Conversa D.Colonna A.Demaio E.Ramazzotto S.Larfaoui F.Pinto J.Le Mercier P.Xenarios I.Dauphin G.The EMPRES-i genetic module: a novel tool linking epidemiological outbreak information and genetic characteristics of influenza viruses Database 20140201410.1093/database/bau 008PMC 394552624608033 · doi ↗ · pubmed ↗
6Olson R.D.Assaf R.Brettin T.Conrad N.Cucinell C.Davis J.J.Dempsey D.M.Dickerman A.Dietrich E.M.Kenyon R.W.Kuscuoglu M.Lefkowitz E.J.Lu J.Machi D.Macken C.Mao C.Niewiadomska A.Nguyen M.Olsen G.J.Overbeek J.C.Parrello B.Parrello V.Porter J.S.Pusch G.D.Shukla M.Singh I.Stewart L.Tan G.Thomas C.Van Oeffelen M.Vonstein V.Wallace Z.S.Warren A.S.Wattam A.R.Xia F.Yoo H.Zhang Y.Zmasek C.M.Scheuermann R.H.Stevens R.L.Introducing the bacterial and viral bioinformatics resource center (BV-BRC): a resource combining PATRIC, IRD and Vi PR Nucleic Acid · doi ↗ · pubmed ↗
7Charostad J.Rezaei Zadeh Rukerd M.Mahmoudvand S.Bashash D.Hashemi S.M.A.Nakhaie M.Zandi K.A comprehensive review of highly pathogenic avian influenza (hpai) h 5n 1: an imminent threat at doorstep Travel Med. Infect. Dis.55202310263810.1016/j.tmaid.2023.10263837652253 · doi ↗ · pubmed ↗
8Federhen S.The NCBI taxonomy database Nucleic Acids Res.40D 12011 D 136D 14310.1093/nar/gkr 117822139910 PMC 3245000 · doi ↗ · pubmed ↗