Genomics-based assessment of the geographic origin of spongy moths (Lymantria dispar) intercepted during vessel inspections, using SpongySeq, an amplicon sequencing panel
Sandrine Picq, Arnaud Capron, Julien Prunier, Brian Boyle, Abdelmadjid Djoumad, Don Stewart, Yunke Wu, Richard Hamelin, Michel Cusson

TL;DR
Researchers developed a genomic method to identify the geographic origin of invasive spongy moths intercepted in North American ports, helping prevent their spread.
Contribution
A novel amplicon sequencing panel (SpongySeq) was developed to accurately assign geographic origins of spongy moths using 283 SNPs.
Findings
The SpongySeq panel achieved 82–97% accuracy in assigning spongy moths to 19 geographic groups using DAPC.
Most intercepted egg masses in U.S. ports were traced to Japan, indicating a primary source of invasion risk.
Abstract
Invasive alien species (IAS) are a major threat to native biodiversity, ecosystems services, economic stability and human well-being. The two spongy moths, Lymantria dispar asiatica and L. dispar japonica, native from Asia, are important defoliators of a wide variety of hardwood and coniferous trees, and the risk of their introduction into North America via sea transport is considered high by plant protection regulatory authorities. To prevent such introductions, a cost-effective approach consists in reducing the likelihood that IAS will enter the invasion pathway. This involves identifying the geographic origins of moths intercepted during vessel inspections in North American ports and implementing preventative measures in those foreign ports identified as the sources of moths. In the present work, we designed a genomic-based method for the accurate identification of the geographic…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6- —Genomics Research and Development Initiative (GRDI - Government of Canada)
- —https://doi.org/10.13039/100008762Genome Canada
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLepidoptera: Biology and Taxonomy · Insect Pheromone Research and Control · Environmental DNA in Biodiversity Studies
Introduction
Invasive alien species (IAS) are organisms that establish and spread rapidly in a new environment [1], threatening native biodiversity, ecosystems services or human economy and well-being [2]. These invasive species count among the main direct drivers of global change in nature, along with modifications in land and sea use, direct exploitation of organisms, climate change, and pollution [3]. Among known IAS, insects greatly outnumber other animal taxa in terms of invasion occurrences [4], while their global cost has been estimated at a minimum of 70 billion USD per year for goods and service losses [5]. The forests of North America are not spared from insect IAS, with one establishment recorded every 2.1–2.4 years over the past 150 years and an acceleration of this rate observed in the last decades (i.e., 1.2 establishments per year; [6, 7]). The two spongy moth subspecies, Lymantria dispar asiatica Vnukovskij, present in continental Asia, and L. dispar japonica Motschulsky*,* distributed across the Japanese archipelago, are important defoliators of a wide variety of hardwood and coniferous trees [8], and are considered at high risk of introduction and establishment by plant protection regulatory authorities of Canada and United States (US) of America. This statement stems from the constant increase in trade by maritime transport from East Asia to northern North America [9], and the recurrent interceptions of egg masses and adult moths during foreign vessel inspections in Canadian and US ports [10]. In addition, the presence of the main preferred host trees and the non-limiting climatic conditions in northern North America make the establishment of the Asian subspecies of spongy moths very likely. For instance, these moths have been recorded on 62 occasions in the US between 1991 and 2019 leading to major eradication campaigns [11]. Another feature that makes these moths undesirable is their ability to interbreed with the European subspecies, L. dispar dispar Linnaeus [12], which has been established in North America since 1868–69, and is now considered as one of the most destructive non-native insects in eastern North America [13]. Given that the Asian subspecies of spongy moths are known for their very broad host ranges and their strong female flight capability, unlike the European spongy moth (ESM; [8, 14, 15]), their establishment could lead to the introgression of these traits into North American ESM populations, resulting in more severe outbreaks and an accelerated spread in North America. In this context, avoiding the introduction of the Asian subspecies of spongy moths is among the highest priorities of plant protection regulatory authorities in Canada and the US.
To prevent the introduction and establishment of an IAS, the most cost-effective measure consists in reducing the likelihood that such an organism will enter the invasion pathway [16, 17]. Regarding Asian subspecies of spongy moths, this involves identifying the geographic origins of moths intercepted during vessel inspections in North America and implementing preventive measures (inspection, vessel cleaning, etc.) in those foreign ports identified as the sources of moths. Recently, genome-wide markers were successfully used to identify the source of invasive exotic insects such as the Formosan subterranean termite, Coptotermes formosanus Shiraki [18], the cabbage butterfly, Pieris rapae Linnaeus [19] and the Asian longhorned beetle, Anoplophora glabripennis Motschulsky [20, 21]. These successes rely on the characterisation of the genomic variation in the IAS’s native range and the fine mapping of this variation [22, 23]. Many studies have assessed genetic variation in the spongy moth’s native range using different markers and a varying assortment of focal populations (e.g. [15, 24–36]). However, the work of Picq et al. [37], based on an analysis of 2125 genome-wide neutral single nuclear polymorphisms (SNPs) genotyped in 65 populations, provides the highest resolution obtained to date for this species’ spatial population structure, with the identification of genetic differences at the populational or regional (~ 500 km) scale. This high degree of population structure provides an opportunity for the development of a genomic tool that can reliably identify the geographic origin of an intercepted spongy moth.
The adoption of a new genomic tool by plant protection authorities rests on its accuracy, effectiveness and affordability [38, 39]. For a tool assessing the geographic origin of a specimen, this can be achieved by selecting and using only the most informative markers for assignment of an individual to its population of origin. Indeed, such an approach can reduce genotyping costs while retaining most of the power of the complete set of markers [40]. For the spongy moth, population assignment studies demonstrated high assignment success (≈ 95%) using as few as 60 DNA loci (55 nuclear and 5 mitochondrial; [34]) or 48 SNPs obtained through genotyping-by-sequencing (GBS; [32]). However, these high assignment rates were respectively obtained using very wide assignment regions or only laboratory-reared insects, which limits their operational usefulness. In the present work, we developed an AmpliSeq SNP panel on the Ion GeneStudio S5 System (Thermo Fisher Scientific) to identify the geographic origins of intercepted spongy moths at a fine scale, i.e. regional or country scale. We first reassessed the worldwide population structure of the spongy moth using the dataset from Picq et al. [41], but with the difference that we considered both neutral and selected SNPs in our analysis. Indeed, the use of divergent (i.e. high FST outlier) loci may improve in fine the assignment success of an individual to its geographic origin [42, 43]. Then, we assessed to what extent we could reduce the size of the SNP panel without affecting the high assignment rate (≥ 90–95%) of moths to their original geographic region, when using one of three different methods: a multivariate approach, discriminant analysis of principal components (DAPC; [44]), and two supervised learning methods named Support-Vector-Machine (SVM) and Naïve Bayes, all implemented in the R package, assignPOP 1.2.4 [45]. Then, we designed an AmpliSeq panel, named SpongySeq, to amplify the genomic regions containing the informative SNPs, followed by an evaluation of its performance by assignment analyses using the methods cited above and involving spongy moths, adult and egg masses, whose geographic origins were known. Finally, the geographic source of egg masses intercepted in US ports by regulatory authorities was assessed using the SpongySeq panel.
Materials and methods
Genetic data and SNP filtering
The genotyping-by-sequencing-derived SNP data set originates from a previous study [41] and was obtained from contemporary spongy moth populations sampled to provide optimal coverage of the full geographic range of L. dispar. From this data set, we only kept 6 out of the 9 sampled North American locations, given the weak genetic differentiation observed within this region [37]. This removal reduced data redundancy and sampling imbalance, which could have affected inference of population structure [46, 47]. Altogether, the data considered here consisted of 1343 spongy moths collected at 61 sites in 25 countries (Fig. 1 and Supp. Table 1). Methodological details regarding sampling, DNA extraction and GBS sequencing can be found in Picq et al. [37]. Briefly, DNA was extracted from one antenna and three legs per moth with the DNeasy 96 Blood & Tissue Kit (Qiagen, Carlsbad, CA, USA) following the manufacturer's instructions. Libraries were prepared based on a genotyping-by-sequencing (GBS) protocol using the restriction enzymes PstI and Msp [48] and the sequencing was made on Ion Torrent Proton P1v2 chips. Then, SNPs were discovered and called using Tassel 3.0 GBS [49], with physical alignment to the L. dispar asiatica reference genome (lda.genome.v0.3.masked.fasta.; [50]), using the Burrows-Wheeler Aligner (BWA; [51]). Missing data filtering was initiated with low cut-off values (genotype call rates ≥ 50% and individuals with ≤ 90% of missing data) that were then iteratively and alternately increased to obtain a final data set with a genotype call rate ≥ 80% comprising individuals with ≤ 25% of missing data [52]. Then, sequencing and genotyping errors were discarded by removing SNPs with a minor allele frequency (MAF) < 0.01 across populations [53], with heterozygosity > 0.6 [54] and in Hardy–Weinberg disequilibrium (p-value < 0.001) in at least 6 populations [55]. To reduce genetic linkage among SNPs in our data set (i) we used only the first SNP in each contig (the about 100 bp single-end reads generated by the genotyping-by-sequencing method) and, then (ii) we used the SNP most often genotyped in pairs of SNPs in different contig showing an r^2^ value ≥ 0.8 in at least 6 populations. Indeed, SNPs physically linked can introduce bias in different classical analyses requiring independence of loci such as SNP outlier detection and certain population clustering method [56]. With respect to the SNP filtering step, the process was analogous to that described by Picq et al. [37], with one exception: the heterozygosity threshold above which a SNP was discarded was increased to 0.6, from 0.5. This less stringent threshold appears to be a better compromise between discarding paralogous loci and keeping informative loci [57, 58]. All filtering procedures were carried out using VCFtools [59], and the resulting VCF file was converted to formats suitable for each subsequent analysis, using PGDSpider v2.1.1.5 [60].Fig. 1. Sampling locations identified by their abbreviated names in (A) North America, (B) Asia and (C) Europe, North Africa and the Middle East. The numbers correspond to the following populations: (1) KR02, (2) KR03, (3) KR04, (4) JP02, (5) JP03, (6) JP04, (7) JP07, and (8) JP08. See Suppl. Table 1 for details
Assessment of population structure
The work of Picq et al. [37] evaluated the population structure with neutral and divergent SNPs separately. Here we considered all SNPs together given that those subject to divergent selection may increase the average level of genetic differentiation among populations in cases where neutral SNPs reveal only subtle differences [43]; such an approach is deemed preferable when the aim is to obtain high assignment rates [42]. Thus, population structure was first inferred using a model-based method employing a maximum-likelihood approach implemented in ADMIXTURE v1.3.0 [61] and a k-means algorithm implemented in the find.clusters function of the adegenet R package [44]. We ran ADMIXTURE with a cross-validation procedure for a number of groups K ranging from 2 to 20. For each K value, calculations were repeated 10 times, using different random seeds to assess the stability of the estimate. The most likely numbers of K populations were identified as being the ones exhibiting the lowest cross-validation error compared to other K values. The number of populations was also assessed using a k-means method based on the Hartigan-Wong algorithm implemented in the kmeans function (stats R package). The find.clusters function (adegenet R package) was used to run sequentially the k-means method with an increasing number of K values (from 2 to 20), and the optimal number of groups was identified based on the Bayesian information criterion (BIC; lowest value). The find.clusters function was also run 10 times with different random seeds. To complete the population structure assessment, a principal component analysis (PCA) was carried out using the dudi.pca function of the adegenet R package to visualize the overall variability among individuals and populations. For the k-means and PCA analyses, the missing genotypes were replaced with the mean allele frequencies calculated on the entire dataset using the R function scaleGen (adegenet R package; [44]).
Assignment of moths to source populations and development of SNP panel
To assess the performance of our SNP set to correctly assign a moth to its geographic origin, we used three different approaches. First, we employed discriminant analysis of principal components (DAPC), a procedure that has the property of highlighting genetic differentiation among groups while overlooking within-group variation [44]. We then used a supervised learning method named Support-Vector-Machine (SVM), which aims to find the optimal hyperplane in an N-dimensional space (N, here, is the number of SNPs) that distinctly classifies the individuals. For this method, the Radial Basis Function (RBF) kernel was chosen to transform data to facilitate the hyperplane calculation. Finally, we implemented another supervised learning algorithm called Naïve Bayes (NB), which is based on Bayes’ theorem. This method is named naive due to its underlying “naive” assumption that each input variable (i.e. SNP) used for classification purposes is independent.
To build and assess the performance of SNP panels of different sizes using each of these three methods, we relied on a double cross-validation technique known as Simple Training & Holdout, (STH; [40]). Briefly, in each geographic group, the individuals are randomly divided into training (95% of the individuals) and holdout (or test; 5% of the individuals) sets. Then, all training samples i.e., 1098 moths are used to assess allelic frequencies and rank loci according to their ability to differentiate moths from the 19 identified geographic groups (see results, Fig. 4). Here, the SNPs were ranked according to their fixation index (FST) and used to create nine panels of different sizes (2000, 1500, 1000, 500, 350, 200, 100, 50 and 25 SNPs), each containing top-ranked SNPs. Indeed, the selection of SNPs displaying the greatest genetic differentiation (e.g., FST) maximises the accuracy of assignment to the source population [42, 62, 63]. Next, samples of the holdout set were assigned to a source population using the DAPC, SVM or NB method, where the training samples provided the reference/baseline. Thus, the STH method avoids upward assignment biases occurring when the same individuals are used to rank the SNPs and to test the assignment power of these SNPs [40]. To assess consistency of the results, calculations were made for 20 different training/holdout sets. These different training/holdout sets were built on the K-fold principle in which individuals from each geographic group were divided into K groups (here K = 20). Thus, a given K group (holdout set) was assigned by the predictive model built based on the remaining K-1 groups (training set). This method makes it possible to test every individual in a dataset and to flag unexpected assignment patterns. The DAPC, the SVM and the NB methods were carried out using the assign.kfold function implemented in the R package assignPOP 1.2.4 [45]. For the SVM method, the kernel was specified with the parameter svm.kernel ="radial" in the assign.kfold function. The results for the DAPC method were obtained by selecting the model lda. Indeed, the assign.kfold function uses a PCA to reduce the dimension of the SNP dataset before carrying out a linear discriminant analysis (LDA), that is equivalent to a DAPC.
For each method, we used three different criteria to measure assignment accuracy of a specimen’s original geographic group. Thus, a specimen is correctly assigned to its geographic group if: (1) the posterior probability is ≥ 0.90 for this group, (2) the posterior probability is ≥ 0.80 for this group (the criterion used by Picq et al. [37]), and (3) its “relative probability” is ≥ 2 for this group, as used by Wu et al. [34]. “Relative probability” is defined here as the ratio between the highest and the second highest posterior probability of membership for a given specimen [64]. If a specimen does not meet the assignment criterion, it is considered misassigned i.e. not assigned to its known original geographic group.
SpongySeq Ampliseq panel design
To design AmpliSeq primers targeting the SNPs making up the panel, 10 ng of gDNA from 52 individuals selected to cover as possible the geographic groups defined here (see section population structure) were pooled and sequenced to assess genetic variability in the vicinity of the targeted SNPs (Supp. Table2). First, the DNA pool was sheared to a median size of 300 bp (200–400 bp range) using a M220 Focused-ultrasonicator (Covaris, Woburn, MA, USA) and the Whole Genome Shotgun (WGS) library was prepared using the NEBnext ULTRA II DNA kit (New England Biolabs, Ipswich, MA USA), according to the manufacturer’s instruction. The library was sequenced using a 150 bp paired-end (2 × 150 PE) protocol on an Illumina HiSeq 4000 sequencing system (Illumina Inc., San Diego, CA, USA), at the Génome Québec Innovation Centre, Canada. After adapter trimming, the resulting reads were aligned onto the L. dispar asiatica genome v0.4 (F.O. Hebert, personal communication) using bwa mem [65] with default parameters. The resulting aligned reads were then filtered for duplicates with samtools rmdup and sorted with samtools sort [66]. The resulting bam file was analyzed using GATK HaplotypeCaller, with a ploidy of 104 (–sample-ploidy = Number of samples * Sample Ploidy; [67]). Insertion/deletion polymorphisms were filtered out using vcftools, and so were SNPs with a mean depth ≤ 49 across all individuals [59]. The remaining 19,306,562 SNP coordinates were translated into a bed file.
As the informative SNPs generated by GBS were identified using the v0.3 reference genome [50], we first identified their corresponding coordinates on v0.4. Briefly, the 100 base pairs surrounding a SNP were extracted from v0.3 as a fasta sequence file and then blasted against v0.4 using the blastn command [68]. The v0.4 coordinates were then extracted from the blast results and checked for multiple hits. From the starting list of 350 v0.3-based SNPs (see results, Supp. Table 3), 65 were filtered out due to matching more than three loci in v0.4. Out of the 285 remaining SNPs, 262 were single copy, 12 had two matches and 11 had three matches, yielding a total of 319 (262 * 1 + 12 * 2 + 11 * 3) targets using the v0.4 reference genome (Supp. Table 3). The coordinates for the 319 positions were formatted as a bed file.
The reference genome v0.4, the bed file containing the coordinates of the 19,306,562 population-wide SNPs and the bed file containing the coordinates of the five base pairs surrounding each of the 319 target positions were submitted to the Ion AmpliSeq Designer website (https://www.ampliseq.com/). The portal returned an Ion AmpliSeq design covering 315 of the 319 submitted targets (i.e., 283 SNPs covered; Supp. Table 3). The primers for the proposed SNP panel were then ordered from Thermo Fisher Scientific (design number: IAD210066).
SpongySeq testing
To assess the performance of the SpongySeq panel, two different DNA sample sets were considered (Supp. Table 4). These different DNA sample sets were designed to test different developmental stages (egg masses and adults), to allow comparison of assignment results obtained with GBS vs. the AmpliSeq methods and, finally, to assess the geographic origin of new individuals. We examined egg masses and adult specimens, as these are the stages most frequently intercepted by regulatory authorities. Indeed, while inspectors look primarily for egg masses when searching vessels, pheromone traps are used around ports to monitor male moths that could have escaped, thus increasing the likelihood of intercepting accidental introductions.
First, an “adult DNA set” was comprised of (1) 39 DNA samples from moths already GBS-sequenced in the present study and whose geographic assignments were assessed in analyses described in the above section and (2) 57 moths whose geographic origins (i.e., sampling sites) were known but had not yet been sequenced. Then, an “egg mass DNA set” was made up of DNA samples not yet sequenced, including (1) 11 egg masses whose geographic origins (i.e., sampling sites) were known and (2) 29 egg masses intercepted by USDA on vessels inspected in US ports and already identified as belonging to L. d. asiatica or L. d. japonica using the COI-based TaqMan assay developed by our team ([69]; Supp. Table 4). DNA extraction of the new specimens followed the protocol detailed in Picq et al. [37]; for egg masses, DNA was extracted from 7 eggs/egg mass. Sequencing libraries were prepared using the AgriSeq Library kit (Thermo Fisher Scientific) following the manufacturer’s instructions with a few modifications. Briefly, 3 ng of genomic DNA was used for amplification with 19 PCR cycles before adding IonCode Barcode Adapters. Barcoded PCR products were purified using 1.5 × Ampure beads (Beckman Coulter), quantified with AccuClear Ultra High Sensitivity dsDNA Quantitation kit (Biotium) and pooled in equimolar amounts. 20 ng of the pooled DNA was then amplified with Ion Express primers for 9 cycles using Q5 High-Fidelity DNA Polymerase (New England Biolabs) and purified with 1.3 × Ampure beads. Libraries were quantified by Qubit dsDNA High Sensitivity Kit (Thermo Fisher Scientific) and ran on a Bioanalyzer High Sensitivity DNA chip (Agilent) for quality control. Libraries were diluted to 200 pM and sequenced on Ion 540 chips with an Ion GeneStudio S5 System (Thermo Fisher Scientific). Library preparation and sequencing were carried out at the Genomic Analysis Platform of the Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval (Quebec, Canada). The Ion Torrent reads were mapped onto the reference genome (v0.3) using TMAP (https://github.com/iontorrent/TS/tree/master/Analysis/TMAP); the reference genome v0.3 was chosen to facilitate the merge of the SNPs-derived Ampliseq set of the “specimen to be assigned” with the SNPs-derived GBS of the 1156 spongy moths (see results) needed for the assignment analyses. Then, for the “adult DNA set” and “egg mass DNA set”, polymorphic positions were called using bcftools mpileup, with the bcftools -m (multiallelic-caller) option on [66]. In the two resulting vcf files, the genotypes showing a quality score < 20 (–minGQ 20) and a minimum depth < 10 (–minDP 10) were marked as missing using vcftools [59]. In addition, sites revealing a number of alleles different than 2 (–min-alleles 2 –max-alleles 2), a Minor Allele Count ≤ 3 (–mac 3) and a proportion of missing data ≥ 5% (–max-missing 0.95) were discarded. Finally, samples from both datasets were assigned to one of the 19 geographic groups using the DAPC and the Naïve Bayes methods and the assign.X function implemented in the R package assignPOP 1.2.4 [45]; the SVM method was not considered given its lower performance (see results section). In these assignment analyses, the 1156 moths analysed to build the panel provided the reference/baseline dataset, and the assignment accuracy was measured with the three different criteria described above (see section Assignment of moths to source populations and development of SNP panel).
Results
GBS-derived SNP genotyping and filtering
Starting from the raw sequence files of Picq et al. [41], we extracted reads for 1343 moths collected in 61 sites in 25 countries, a subset of the original dataset, and identified 90,907 SNPs using Tassel 3.0 GBS. Following the filtering procedure, we obtained 2,340 SNPs, which include both neutral and outlier loci (Supp. Table 5), with a genotype call rate ≥ 80% for 1156 moths that had ≤ 25% of missing data (the number of moths kept per population are detailed in Supp. Table 1).
Population structure
The k-mean methods implemented in the find.clusters function defined a likely number of genetically distinct clusters K = 8 (lowest BIC median and lowest BIC values) (Supp. Figure 1 A). With the maximum-likelihood approach implemented in ADMIXTURE v1.3.0, the minimum number of groups varied between K = 10–14 (Supp. Figure 1B), but the analysis of the membership coefficient revealed the presence of ghost clusters, i.e., groups in which no individual achieved a membership coefficient Q > 0.5 [70]. As suggested by Puechmaille [47], we corrected the number K of clusters by removing those considered spurious, which resulted in a Kcorrected = 8–12. Thus, both analyses identified a minimum of eight groups: (1) North-America, (2) Algeria and Iberian peninsula, (3) western and central Europe (France, Belgium, western Germany, Italy, Slovenia, Austria, Slovakia, Kosovo, Bulgaria), (4) Caucasus and Middle East, (5) northern Europe (north-east Germany, Poland, Estonia) and western Russia up to the Novosibirsk region, (6) eastern Central Asia (Kazakhstan, Kyrgyzstan, Altay region in Russia and China), (7) continental East Asia and (8) Japan (Fig. 2). For the 8 < Kcorrected ≤ 12 in the ADMIXTURE v1.3.0, three additional subdivisions were detected (i) separation of the Caucasus populations (Russian Caucasus, Georgia and Turkey) from the Middle East ones (Israel and Syria), (ii) separation of the Estonian population from other northern Europe and western Russia populations and, (iii) separation of northern populations of the Iberian peninsula (PT01 and SP01) from Algerian and other Spanish populations (Fig. 2). In addition, the patterns revealed by the admixture plot suggested supplementary divisions within the western and central Europe group (4 groups: a, France; b, Belgium-western Germany; c, Italy-Slovenia; and d, Central Europe-Balkan, see Fig. 2), the northern Europe and western Russia group (2 groups: e, north-east Germany and Poland; f, western Russia), in East central Asia (2 groups: g, Altay region in Russia and China; h, Kazakhstan and Kyrgyzstan) and in continental East Asia (4 groups: i, eastern China; j, north China; k, Russian Far East; and l, Korea).Fig. 2. Barplot showing probability of membership of each sample to each of K = 8 geographic groups (a different color for each group), as computed by admixture, based on 2340 SNPs derived from 1156 L. dispar individuals. Samples along the x axis are ordered from west to east according to their longitudinal coordinates, with possible exceptions for some of the more southerly samples from the Caucasus/Middle East, which were placed between those of Western and Eastern Europe. The three dotted lines (i, ii, iii) indicate the additional subdivisions identified within the 8 < K ≤ 12 values and the letters (a-l) highlight supplementary divisions suggested by admixture patterns (see Results section for details)
The principal component analysis (PCA) revealed a west-to-east gradient (PC1 6.15%) with populations from the Iberian Peninsula and Algeria seen at one end of the gradient and populations from far east Asia observed at the opposite end (Fig. 3A). The second major feature was a clear separation of North American and Japanese populations along the PC2 axis (2.21%) from European and Asian populations, respectively. The second axis also revealed a cluster of populations from the Caucasus (Georgia and south-western Russia) and the Middle East (Turkey, Israel, and Syria) as distinct from those in Europe. Along the west-to-east gradient, different geographical groups stood out: (i) Algeria and Iberian Peninsula, (ii) Italy-Slovenia, (iii) France, (iv) Central Europe-Balkan, (v) Belgium-western Germany, (vi) northern Europe, (vii) western Russia, and (viii) the Altay region in Russia and China (Fig. 3A). The PC2 also revealed three distinct groups within continental East Asia: a, China; b, Russian Far East; and c, Korea, whereas the PC3 (2.17%) vs. PC4 (1.35%) plot (Fig. 3B) highlighted a difference between the eastern (i) and northern (ii) regions of China, confirming the ADMIXTURE results. Finally, the PC1 vs. PC4 plot shows Kazakhstan and Kyrgyzstan as forming a cluster of populations distinct from those of the Altay region of Russia and China and western Russia populations (Fig S2). To sum up, results of the above analyses led to the identification of 19 distinct geographic groups, as illustrated in Fig. 4.Fig. 3. Principal component analysis (PCA) based on 2340 SNPs derived from 1156 L. dispar individuals. Individuals (dots) are colored according to their country of origin (25 different countries). A PC1 vs PC2 plot. The dotted ellipses indicate populations from (i) Algeria and Iberian Peninsula, (ii) Italy-Slovenia, (iii) France, (iv) Central Europe-Balkan, (v) Belgium-western Germany, (vi) northern Europe, (vii) Western Russia, (viii) the East central Asia and three different regions in continental East Asia: (a) China, (b) northeast China (CN04) and Russia Far East and (c) Korea. B PC3 vs PC4 plot revealed an additional geographic subdivision in China with the eastern (i) being different from the northern (ii) regionsFig. 4Map of the 19 well supported subgroups identified in (i) Eurasia and (ii) North America. AIB: Algeria and south Iberian Peninsula; NIB: North Iberian Peninsula; FRA: France; BED: Belgium and Western Germany; DPL: Eastern Germany and Poland; ITS: Italy and Slovenia; BAL: Balkan; EST: Estonia; RUW: Western Russia; CAU: Caucasus; MDE: Middle East; KGK: Kazakhstan and Kyrgyzstan; RUC: Central Russia; CNE: East China; CNN: North and Northeast China; RUE: Russian Far East and Northeast China; KOR: Korea; JAP: Japan; NAM: Northeastern North America. See Suppl. Table 1 for details of the populations making up each geographic group
Moth assignment analysis using the full SNP data set
Using the DAPC and NB methods with the full set of SNPs (2340), the average assignment success of individuals to their respective geographic groups was high, with a minimum of 96% and 87% of average success, respectively, whatever the assignment threshold used (Fig. 5A, Supp. Table 6). The SVM radial method yielded lower assignment success (65% and 81%) except when the “relative probability” was used as assignment threshold (91%). Interestingly, two geographic groups, eastern China (CNE) and Korea (KOR), stood out with respect to their much lower assignment success for each of the three methods used. The low performance for these two geographic groups was caused by two specific populations, CN08 (CNE group) and KR04 (KOR group), whose individuals were incorrectly assigned to the CNN and RUE groups, respectively. Picq et al. [37] had already noticed that these two populations had genetic profiles that did not correspond to their geographic location. Thus, the calculations were redone after discarding these two populations from the data set. Following this operation, the assignment success for the CNE and KOR groups increased by 9–36%, depending on the method and threshold used (for details see Supp. Table 6). Globally, the removal of the CN08 and KR04 populations increased the average assignment success by 1–2% (from 96.6% to 98.3%) for the DAPC method and by 2–4% (from 91.06% to 93.10%) for the NB method (Fig. 5B). The results for the SVM radial method did not change except where the “relative probability” threshold was used, which increased by 2%. Use of the NB method led to another interesting observation in that it revealed clearly below-average performance for the Belgium-western Germany (BED; < 45%) group compared with the other geographic groups (> 58% whatever the threshold considered) and with the results of the DAPC method (97.62% for the BED group). Overall, the DAPC method is the model that affords the best performance for our full SNP dataset as its assignment success rates are the highest (> 96% at minimum) whatever the assignment threshold used (Supp. Table 6).Fig. 5. Assessment of the accuracy of the 2340 SNPs in identifying the original geographic group of spongy moth samples with (A) or without (B) the populations CN08 and KR04. The boxplots represent the assignment success in percentage as a function of three different methods (DAPC, Naïve Bayes and SVM radial) and three assignment thresholds (see section Moth assignment and SNP panel building for details). The boxplot indicated the distribution of assignment success values obtained. The non-parametric Kruskal–Wallis tests revealed significant differences in assignment success obtained by the three methods for each threshold considered with or without the populations CN08 and KR04. With: 0.9: H(2) = 52.247, p-value < 0.001; 0.8: H(2) = 49.666, p-value < 0.001; Rel. prob.: H(2) = 36.092, p-value < 0.001. Without: 0.9: H(2) = 52.349, p-value < 0.001; 0.8: H(2) = 49.276, p-value < 0.001; Rel. prob.: H(2) = 31.32, p-value < 0.001
Moth assignment analysis with SNP panels of different sizes
Based on the above analysis using the entire dataset, the CN08 and KR04 populations were excluded from subsequent calculations. The average median assignment success remained globally high (≥ 90%) as SNP panel size decreased all the way down to 350 SNPs, for both the DAPC and NB methods, irrespective of the threshold used (Fig. 6, Supp. Table 7). With panel sizes ≤ 100 SNPs, assignment success displayed pronounced drops (from 14 to 64%) for both the DAPC and NB methods (Supp. Table 7) in comparison to the complete dataset i.e., 2340 SNPs. At the 200 SNP panel size, the drop in assignment success is less pronounced (from 5.5% to 14%) and average assignment success remains satisfying with all values above of 80% (from 81 to 89%). Regarding the SVM radial method, the assignment success for the full range of SNP panels was, as expected, lower than with the other two methods, except for when the “relative probability” threshold was used, in which case the values were comparable to those of the NB method (Supp. Table 7). As for the full SNP dataset, the DAPC method is the model that affords the best performance as it showed the highest average assignment success rates for most of the panel sizes, whatever the thresholds used for all the panel sizes, excepted for the 25 SNPs panel size.Fig. 6. Assignment success as a function of decreasing size of SNP panel, as assessed using two different methods: A a multivariate approach, discriminant analysis of principal components (DAPC) and (B) Bayesian approach named Naïve Bayes. The SNPs that made up the panels were selected according to their F_ST_ values. Each boxplot represents the distribution of the assignment success calculated for 20 different training/holdout sets (K-fold approach) using the Simple Training and Holdout method (STH; Anderson, [40])
So, in light of these results, the 200 SNP panel size seemed to be the lowest size we can consider. This is particularly true in view of the fact that the observed drop at this panel size was mainly due to incorrect assignment for geographic groups located in continental East Asia, i.e., eastern Russia (RUE), Korea (KOR), northern China (CNN) and eastern China (CNE; Supp. Table 8). For regulatory purposes, the assignment accuracy for geographic groups present in continental Asia must be maintained as high as possible considering that this region is a potentially important source of L. d. asiatica spongy moths. Given that the minimum acceptable count of SNPs is 200, and acknowledging that an unpredictable number of SNPs could be lost during the Amliseq design (see Results), we have decided to proceed with 350 SNPs. This number strikes a favorable balance between the costs of development, the likelihood of successful assignment, and the risk of significant SNP loss during the development process. Thus, the SpongySeq panel was developed starting with the 350 SNPs displaying the highest FST values.
SpongySeq panel sequencing and testing
For the adult and egg mass DNA sets used to assess the performance of the SpongySeq panel, we obtained on average 396,153 mapped reads per individual, of which 97% were located on targeted loci, resulting in a mean depth of 1073 reads. The average read depth for the 315 amplicons is 1224 reads; however, six of these did not achieve a call rate of 70%, meaning they were not observed in at least 70% of the sequenced specimens, and thus were excluded from further analysis (Supp. Table 3). Moreover, nine amplicons exhibited a read depth of less than 10 in over 35 individuals (more than 25% of specimens) and were similarly discarded (Supp. Table 3). Of the 39 moths previously GBS-sequenced and whose geographic sources had been assessed using the entire SNP dataset (2340 SNPs), 34 meet the criteria in terms of genotype quality and sequencing depth (see Material and Methods section) for assignment analysis (Supp. Table 9). Using the DAPC method and the two more stringent thresholds, the assignment success of the SpongySeq panel was 6.2% lower than with the entire SNP set (88.2% vs. 94.4%), but equivalent to the latter using the “relative probability” threshold (both 97%). Interestingly, using the NB method, the SpongySeq panel performed better than the entire SNP set when the least stringent assignment criterion was used (91.2% vs. 88.9%). For the 0.9 and 0.8 thresholds, the SpongySeq panel showed 12.6% and 6.5% lower assignment success than the entire SNP set, respectively (Supp. Table 9).
For the second portion of the adult DNA set, i.e., the 57 moths whose geographic origins (i.e., sampling sites) were known but had not yet been sequenced, the DAPC method correctly assigned all but three moths to their source populations (94.7% correct assignment; Supp. Table 10). These three moths, all sampled in Europe (Belgium, France and Italy), were assigned to nearby geographic groups (north-east Germany and Poland [DPL], Balkan [BAL], and northern Iberian Peninsula [NIB], respectively). However, the NB method displayed lower performance (73.7%—78.9%), particularly with respect to correctly assigning samples from Belgium and a subset of samples from Korea (Supp. Table 10).
For the egg mass DNA set, among the 11 egg masses whose origin was known, only seven provided enough sequences for analysis. The egg masses collected in Canada were, as expected, assigned to the North American geographic group (NAM) by both assignment methods, using the highest assignment threshold. Interestingly, all egg masses collected in Corsica, a French mediterranean island, were assigned to the Italy-Slovenia group (ITS), while egg masses from Croatia were either assigned to the Balkan (BAL) or the Italy-Slovenia group (ITS). For these latter cases, the two assignment methods provided congruent results, with posterior probabilities meeting the criteria of the lowest assignment threshold (relative probability) for at least one of the two assignment methods (Supp. Table 11).
Regarding the 29 egg masses intercepted in US ports, 28 were successfully sequenced and 27 had matching assignments for the two assignment methods. All 27 egg masses revealed posterior probabilities meeting the criteria of the lowest assignment threshold (relative probability) for at least one of the two assignment methods, while 22 of them exceeded the highest threshold (posterior probabilities > 0.9) with both methods. In one instance, CA19PORT_22, the two methods generated somewhat different assignments: while the DAPC method assigned the egg mass to the RUE group (using the relative probability criterion), assignment using the NB method did not meet any of the thresholds, with the two highest posterior probabilities being similar for the CNN and RUE groups, but slightly higher for the former group (0.51 vs 0.49). Based on these considerations, this egg mass most likely originated in the Russian Far East, i.e., the RUE group. Overall, most egg masses were assigned to Japan (18 egg masses, i.e., 64%), followed by the Russian Far East (8), northern China (1) and Korea (1). These geographic assignments are congruent with the subspecies identifications performed using our TaqMan assay ([69]; Supp. Table 11).
Discussion
The aim of the present study was to develop a AmpliSeq panel using the Ion Torrent Next-Generation sequencing technology (Thermo Fischer Scientific) to accurately identify the geographic origins of spongy moths intercepted in Canadian and U.S. ports of entry during inspection operations. Based on the data presented here, the 283 SNPs (315 targets) panel we designed, named SpongySeq, displays a high level of accuracy, correctly assigning 91–97% of moths to their source population, using either of the most performing assignment methods (DAPC and NB) and the least stringent assignment threshold (“relative probability”, [34]). The striking improvement of the SpongySeq panel relative to similar approaches developed earlier [32, 34] is its ability to accurately assign individuals to their source at a much higher degree of geographic resolution, i.e., at the scale of a region (≈ 500 km) or a country. The prerequisite for such a high geographic resolution is the fine mapping of genomic variation in the spongy moth’s native range, Eurasia. Based on an analysis of neutral SNPs, Picq et al. [37] had previously shown the presence of important genetic differentiation among spongy moth populations (mean pairwise FST of 0.1917) and were able to distinguish 28 geographic groups. However, unlike Picq et al. [37], whose analyses were based exclusively on neutral SNPs, here our characterization of population structure and identification of well-supported geographic groups was based on an analysis of both neutral and divergent SNPs (high FST values). Indeed, previous studies underlined the importance of including SNPs with high genetic differentiation (e.g., FST) when the aim is to maximise accuracy assignment to source populations (e.g. [42, 62, 63]). In addition, while Picq et al. [37] explored their dataset with the objective of detecting the most subtle geographic differentiations, here we aimed at finding a compromise between the number of groups that maximizes geographic resolution and the number of groups that maximizes assignment success. With this goal in mind, we used different classical populations genomics analyses to define 19 groups within the spongy moth’s range, providing sufficient geographic resolution in the regions considered most relevant from a regulatory point of view, namely eastern Asia (5 groups) and the transition zone of female flight capability in central Europe [15]. Assignment analysis using the full set of SNPs (2340) confirmed the appropriateness of these 19 geographic groups as assignment success was high (87–96% on average) using either the DAPC or NB method, irrespective of the assignment threshold used.
Besides its high geographic resolution, the other strong point of the SpongySeq panel is its high assignment success despite the limited number of SNPs that make it up (283), which limits cost of running the assay. To determine the optimal number of SNPs, we first carried out assignment analyses using SNP panels of nine different sizes, ranging from 2000 to 25 SNPs (Fig. 6). These analyses identified 350 as the size offering the best compromise between high assignment success and SNP panel size. Below this number, i.e., ≤ 200 SNPs, assignment success decreased significantly within eastern continental Asia, particularly for the Chinese geographic groups (Supp. Table 8), an outcome that is not desirable given that this region is potentially a significant source of L. d. asiatica spongy moths. However, the drop in assignment success in this region is not surprising as these populations have been observed to be less genetically differentiated [37]. Following transposition of the 350 SNPs onto the updated version of the reference genome (see section SpongySeq Ampliseq panel design) and after completing design of the AmpliSeq primers, the final number of SNPs in our panel dropped to 283 SNPs (or 315 targets). Assignment analysis of 34 spongy moths indicated that the SpongySeq panel was equivalent or only slightly less accurate (minus 6.2%) than the full GBS-derived SNP data set (2340 SNPs) with the DAPC method, irrespective of the threshold use. This result confirms the panel’s minimal loss of assignment accuracy as compared to the full set of SNPs.
We also assessed the performance of our SpongySeq panel on different sample sets. The first one comprised moths whose geographic origins (i.e., sampling sites) were known but had not been sequenced previously. Indeed, for the 57 samples successfully genotyped, the panel performed well as only three samples were not assigned to the correct geographic group. These misassignments were all in Western Europe and the corresponding moths were assigned to geographic groups contiguous to the expected ones. For example, a moth sampled in Belgium (BE02 Site) was assigned to the DPL group (Germany-Poland), which may have resulted from human-aided movement inasmuch as the BE02 site is located on a university campus where the intense movements of goods and people make the introduction of moths from other geographic areas very likely. The assignment results obtained for egg masses sampled in Croatia and Corsica were also interesting. For the two egg masses sampled in Croatia, their SpongySeq geographic assignments (ITS and BAL group; Fig. 4) were in accord with previous work that showed a genetic split between coastal and continental populations in Croatia caused by the Dinaric Alps [29], indicating that populations from coastal Croatia are genetically closer to those from Italy and those from continental Croatia to Balkan populations [30]. For egg masses sampled in Corsica, a French mediterranean island, both were assigned to the Italy-Slovenia group, which is consistent with the phenomenon of recurrent connections between Corsica and central Italy during the Quaternary ice ages [71]. The most important dataset in terms of regulatory significance is the one that comprised 28 intercepted egg masses in different U.S. ports, in 2019 and 2020. These egg masses were primarily assigned to Japan (18 egg masses, 64%) and the Russian Far East (8 egg masses, 28%). Interestingly, Wu et al. [34] who assessed the geographic origins of 26 egg masses intercepted at U.S. ports of entry between 2014 and 2015, found the opposite pattern, i.e., continental Asia the main sources of the egg masses (21 egg masses i.e. 80%), followed by Japan (4 egg masses or 19%). This difference most likely mirrors the high population densities observed in the Russian Far East in 2014 [11] and in Japan in the years 2019–20 (M. Inoue, pers. comm.). These fluctuating patterns of geographic origin, which reverse every few years, show that invasion pathways can change rapidly depending on variations in population densities in native areas or on trade routes.
The development of our diagnostic SNP panel was made possible thanks to the advent of amplicon sequencing approaches that use highly multiplexed polymerase chain reaction (PCR) to genotype many individuals and hundreds of loci in a single high-throughput sequencing run [72]. In recent years, different in-house or commercial amplicon sequencing approaches (e.g. GT-seq [73], TruSeq Custom Amplicon, Ion AmpliSeq custom panels) were successfully used (i) to conduct genetic stock identification in a fish, the walleye (Sander vitreus; [74]), (ii) to detect mutations causing a predisposition to breast and ovarian cancers in humans [75], and (iii) to study variation in candidate photoperiod sensitivity genes among maize races [76], for instance. For our project, we needed an amplicon sequencing approach that met the requirements of spongy moth biosurveillance in terms of input DNA amount, number of targeted sequences and the required level of sample multiplexing per sequencing run. More precisely, in a regulatory context, the intercepted spongy moth specimens must be preserved, and DNA extraction must be done using small body parts (legs, antennae), resulting in small input DNA quantities. In addition, choosing an amplicon sequencing approach that made it possible to include hundreds of potential target sequences gave us greater latitude with respect to panel size, given that the minimum number of SNPs necessary for an accurate geographic assignment was initially unknown. Finally, an approach enabling high specimen multiplexing per high-throughput sequencing run was needed to achieve low-cost analysis of hundreds of spongy moths every year. Considering all these requirements, we chose the Ion AmpliSeq™ (≥ 1 ng input DNA amount, up to 6,144 primer pairs per pool and 384 individuals per sequencing run) and currently, we obtained the amplicon panel genotypes for 40 CA$/sample (price when processing 192 samples) in four days from purified and normalized DNAs. To facilitate the use of our panel, future work aimed at adapting it to other popular sequencing technologies such as sequencing by synthesis (SBS, Illumina) and nanopore sequencing (Oxford Nanopore). With respect to sequencing by synthesis, the TruSeq Custom Amplicon marketed by Illumina was recently discontinued and replaced with the AmpliSeq for Illumina, a change that could greatly facilitate the transfer of technology. Likewise, a recent study in forensic genetics showed successful transfer of an established Ion AmpliSeq panel to nanopore sequencing technology [77]. The nanopore sequencing is particularly attractive for regulatory applications as this technology is implemented in a small, highly portable pocket-size device, the Minion (Oxford Nanopore, Oxford, Great Britain), opening the way to assigning spongy moths directly in the field (ports, border, etc.), thereby accelerating regulatory procedures following the detection of an introduction.
Finally, a recurring obstacle to the use of genomic tools is the need for bioinformatic skills to analyze the data obtained by these tools. So, to ensure that our SpongySeq panel can be used by as many people as possible, we are currently developing a user-friendly interface. In this way, the user will simply need to drag and drop the raw file of the sequences to a loading window of the interface, click the analysis button and obtain within seconds the results in the form of a geographical position displayed upon a map with geographical coordinates.
Conclusions
Our work presents the development and use of an amplicon sequencing panel identifying accurately the geographic origins of spongy moths, an important invasive forest pest, intercepted during ship inspections in North American ports. More precisely, our amplicon sequencing panel, named SpongySeq, which targets 283 SNPs, correctly assigned up to 97% of the spongy moths tested, sampled either as egg masses or adults, to one of the 19 geographic groups defined in the present work. Among the three different models used for assignment analyses, i.e., a multivariate approach, discriminant analysis of principal components (DAPC), and two supervised learning methods named Support-Vector-Machine and Naïve Bayes, the DAPC method afforded the best performance with the present dataset by yielding the highest assignment success rates irrespective of the thresholds considered.
The striking improvement of the SpongySeq panel relative to previous approaches is its ability to accurately assign individuals to their source at a much higher degree of geographic resolution, i.e., at the scale of a region (≈ 500 km) or a country, providing sufficient geographic resolution in regions considered most relevant from a regulatory point of view, such as eastern Asia. Utilizing this panel, an analysis of the origins of 28 egg masses of Asian spongy moths intercepted across different US ports in 2019–2020 demonstrated that the predominant source was Japan, characterized by high population densities during those years. Eventually, to facilitate wider adoption of the SpongySeq panel, we are currently working on its adaptation to other prevalent sequencing technologies and the development of a user-friendly interface.
Supplementary Information
Supplementary Material 1. Supplementary material 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1International Union for Conservation of Nature. Guidelines for the prevention of biodiversity loss caused by alien invasive species. 2000. Available from: https://portals.iucn.org/library/node/12673. Cited 2025 May 12.
- 2Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services. Summary for policymakers of the global assessment report on biodiversity and ecosystem services. Zenodo; 2019. Available from: https://zenodo.org/doi/10.5281/zenodo.3553579. Cited 2025 May 12.
- 3Pogue M, Schaefer PW. A review of selected species of Lymantria Hübner (1819) (Lepidoptera: Noctuidae: Lymantriinae) from subtropical and temperate regions of Asia, including the descriptions of three new species, some potentially invasive to North America. USA: Forest Health Technology Enterprise Team (FHTET); 2007. (Technology transfer series).
- 4Canadian Food Inspection Agency, Animal and Plant Health Inspection Service. Flighted spongy moth complex (FSMC) (Lymantria dispar asiatica, L. d. japonica, L. albescens, L. postalba, and L. umbrosa). 2025. Available from: https://www.aphis.usda.gov/sites/default/files/joint-fsmc-bulletin-usda-cfia.pdf.
- 5Picq S, Wu Y, Martemyanov VV, Pouliot E, Pfister SE, Hamelin R, et al. Range-wide population genomics of the spongy moth, Lymantria dispar (Erebidae): implications for biosurveillance, subspecies classification and phylogeography of a destructive moth. Dryad; 2023. Available from: https://datadryad.org/dataset/doi:10.5061/dryad.wwpzgmsp 5. Cited 2025 May 13.10.1111/eva.13522 PMC 1003385236969137 · doi ↗ · pubmed ↗
- 6Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ar Xiv; 2013. Available from: http://arxiv.org/abs/1303.3997. Csited 2025 May 12.
- 7Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SA Mtools and BC Ftools. Giga Science. 2021;10(2):giab 008.10.1093/gigascience/giab 008PMC 793181933590861 · doi ↗ · pubmed ↗
