A draft genome assembly of the agricultural pest Leucoptera coffeella and analysis of its dsRNA processing machinery is a key step toward RNAi-based biopesticides in Lepidoptera
Jay K Goldberg, Leonardo A Vidal, Erick S L Queiroz, Eliza F M B Nascimento, Marcos J A Viana, Wellington R Clarindo, Andrea Q Maranhao, Natália F Martins, Érika V S Albuquerque

TL;DR
This paper presents a draft genome of the Coffee Leaf Miner and identifies genes involved in RNAi, which could lead to new biopesticides for controlling this pest.
Contribution
The study provides a draft genome assembly and identifies RNAi-related genes in Leucoptera coffeella, offering targets for biopesticide development.
Findings
A draft genome of Leucoptera coffeella was assembled with high gene completeness (91.7%).
Seven core genes of the small interfering RNA machinery were identified and confirmed via qPCR.
The findings highlight potential targets for RNAi-based biopesticides in Lepidoptera.
Abstract
The Coffee Leaf Miner (Lepidoptera: Lyonetiidae: Leucoptera coffeella) is a specialist herbivore and major global pest of coffee plants. Current pest control strategies primarily rely on chemical pesticides which in turn negatively impact both human health and ecological stability. Additionally, the emergence of insecticide-resistant populations underscores the urgent need for more specific and efficient pest management strategies. The development of novel techniques for controlling this insect pest requires rigorous interrogation of its physiology and interactions with host plants at a molecular/genetic level. To enable future research in this vein, we sequenced and assembled a draft L. coffeella genome using PacBio highly accurate long-reads (HiFi). Our assembly is comprised of 1,615 contigs showing fragmentation, yet the majority of gene content is represented (BUSCO complete =…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig. 1
Fig. 2
Fig. 3| Feature | Value |
|---|---|
| Assembly size | 370,558,664 |
| Number of contigs | 1615 |
| Largest contig | 6,298,745 |
| Contig N50 | 544,430 |
| Repeat content (%) | 46.44 |
| GC content (%) | 36.07 |
| Assembly BUSCO complete (odb10_lepidoptera) | 91.6 |
| Annotation BUSCO complete (odb10_lepidoptera) | 91.2 |
| Number of genes (helixer) | 17,467 |
| Functionally annotated genes (eggnog-mapper) | 13,725 |
- —CONCAFE consortium
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInsect Resistance and Genetics · Insect-Plant Interactions and Control · Neurobiology and Insect Physiology Research
Introduction
Lepidopteran insects, which include butterflies and moths, are estimated to represent more than 70% of agricultural pests. Current control practices rely on the application of chemical pesticides to the detriment of off-target species and, in some cases, human health (Pathak et al. 2022), causing global biodiversity losses (Raven and Wagner 2021) and triggering insecticide resistance in important crop pests (Barathi et al. 2024; Willow and Smagghe 2025).
The Coffee Leaf Miner (CLM; Lepidoptera: Lyonetiidae: Leucoptera coffeella; Fig. 1) is a major pest of one of the most traded crops in the world. This monophagous Lepidopteran is currently a cosmopolitan pest present in all coffee producing countries, causing up to 87% losses on both Coffea arabica and Coffea canephora plantations (Picanço Filho et al. 2024). Leaf damage is caused by larval feeding behavior, which consists of burrowing mines into the mesophyll (Fig. 1), leading to necrosis and consequently reducing photosynthesis, accelerating senescence, and causing defoliation that weaken plant growth and reduces yield (Dantas et al. 2021). CLM reproduction is favored by high temperatures and dry climatic conditions, leading to worsening infestations that are further exacerbated by anthropogenic climate change (Leite et al. 2020).
Symptoms and developmental stages of Leucoptera coffeella in coffee plants. The pictures show: (a) Coffee plant (C. arabica) heavily infested by the CLM, showing chlorosis and defoliation; (b) Leaf damage caused by larval feeding, with characteristic necrotic mines along the leaf blade; (c) Larva L2 isolated from leaf tissue after removal from mine; (d) Pupa within its silk protective structure adhered to the abaxial leaf surface; (e) Adult moth of L. coffeella, showing characteristic moth morphology and microlepidoptera size. Scale bars = 1 mm.
RNA interference (RNAi) and post-transcriptional gene silencing continue to offer promising solutions for developing alternative, species-specific biopesticides (Ortolá and Daròs 2024). RNAi-based technologies, which are centered around the application of double-stranded RNA (dsRNA) constructs to silence genes in target species (Niu et al. 2024), include low-risk biopesticides that reduce pest loads on crops while avoiding unwanted environmental consequences (Willow and Smagghe 2025). RNAi constructs can be directly engineered into host plants themselves or delivered by non-transgenic spray-induced methods that may avoid the regulatory complications surrounding genetic modification (Chen et al. 2025). In order to design effective RNAi-based pesticides, we must enhance our fundamental understanding of these mechanisms in insect species of interest; a process that begins with the generation of reference genome resources (Schoville et al. 2018; Sparks et al. 2020).
To enable molecular biological studies of CLM and the development of targeted RNAi-based pest control solutions, we sequenced and assembled a draft genome for L. coffeella. Due to the small physical size of this insect, we used a pooled-sample of multiple individuals that we sequenced with PacBio highly accurate long-reads (HiFi). To improve knowledge about the RNAi machinery genes in Lepidopteran pests, we identified in our assembly multiple genes that are involved with exogenous dsRNA processing pathway and validated their active expression across development stages. Additionally, BUSCO categories allow the search of species-specific biopesticide targets. Our analysis, and the underlying genomic and transcriptomic datasets, will serve as a firm foundation for future research—fundamental and applied—on CLM and its interactions with host plants.
Materials and methods
Sample collection, preparation, and sequencing
Genomic DNA (gDNA) was obtained from a pool of individuals at the pupal stage (Fig. 1), collected from C. arabica leaves originating from Embrapa Cerrados, Planaltina, D.F., Brazil. A modified protocol of the EZNA insect DNA kit (Omega BioTek, Cat. No.: D0926-02) was used for the extractions, as described (Nascimento et al. 2022). DNA quantity and quality were analyzed using NanoDrop, Qubit, agarose gel electrophoresis, and Femto Pulse. High molecular weight gDNA was stored at 4 °C until sequencing began. Genomic sequencing was carried out by Precision Genomics (Gangseo-gu, Seoul, South Korea). The samples were prepared using a PacBio HiFi Express Prep kit and sequenced using the Sequel II platform.
Total RNA was extracted from CLM at different developmental stages: three larval instars (L2, L3, and L4), pupae, male and female adults. Extracted RNA was cleaned with a ReliaPrep RNA Tissue Miniprep System kit (Promega, Cat. No.: PRZ6111), applying the modified protocol described (Nascimento et al. 2022). After extraction, the concentration and quality of the RNA were evaluated using NanoDrop, Qubit, and Bioanalyzer, and the sample was stored immediately at −80 °C until sequencing using Illumina technology Novaseq 6000.
Genome size estimate via flow cytometry
Adult insects (male and female) were collected from two populations in Brazil (Viçosa-MG and Barreiras-BA) and dissected in commercial saline solution. Nuclear extraction and isolation were processed in a BD Accuri™ C6 Flow Cytometer (Accuri, Belgium) and the flow cytometry histograms were analyzed considering G0/G1 fluorescent peaks with coefficient of variation below 5% for nuclear 2C value measurement. Mean nuclear 2C values were converted to Mbp, considering that 1C pg is equivalent to 978 Mbp (Praça-Fontes et al. 2011).
Genome assembly and annotation
Circular consensus sequencing (CCS) output (i.e. HiFi reads; Nreads = 1,712,831; N50 = 14,954) were found to have a highly inflated number of low-coverage contigs (Supplementary Fig. 1) which could be due to genetic heterogeneity or the presence of contamination. To identify contaminant reads, we first produced a metagenomic assembly of our data using metaMDBG (Benoit et al. 2024). We then used the blobtoolkit pipeline (Challis et al. 2020), which calls on minimap2 (Li 2018) and blastn (Camacho et al. 2009), to identify contaminant contigs and filter our PacBio HiFi reads to remove contigs identified as anything other than Lepidoptera (N = 63,334). Kmer analysis of both raw and filtered reads was conducted using jellyfish (Marçais and Kingsford 2011) and GenomeScope2.0 (Ranallo-Benavidez et al. 2020). Filtered reads (N = 1,649,497) were then assembled using hifiasm 2.4.0 (Cheng et al. 2021) with the self-scaffolding function (–dual-scaf) turned on. We assessed the completeness of this assembly using BUSCO v5.4.7 within the lepidoptera_odb10 dataset and determined that our genome was heavily duplicated due to the presence of alternative haplotigs resulting from our pooled sample of multiple insects. To remove duplicated regions, we ran the purge_dups algorithm v1.2.6 (Guan et al. 2020) five times with the -e option removed to include both duplications at the ends and interiors of contigs. Genome statistics were obtained from Bandage v0.8.1 (Wick et al. 2015) after each round of purge_dups. Our purged assembly was then polished with Inspector v1.3.1 (Chen et al. 2021), an error-correction module developed for long-read genome assemblies. Structural annotation was performed using Helixer v0.3.4 using the pre-trained invertebrate model (Stiehler et al. 2021; Holst et al. 2025). The completeness of the annotation was determined using BUSCO v.5.4.7 (Seppey et al. 2019) in protein mode and by aligning RNA-seq reads of multiple life stages (larvae, pupae, and adults; 1× sample per stage) to the genome using STAR v2.7 (Dobin et al. 2013) on the default settings. Functional annotation of the gene models was done using the eggnog-mapper (v2) web portal (eggnog-mapper.embl.de; Cantalapiedra et al. 2021). Repetitive element content of our genome was assessed using RepeatModeler and RepeatMasker (Tarailo-Graovac and Chen 2009).
Structural modeling and qPCR
Gene models that eggnog-mapper identified as RNAi pathway components in L. coffeella data were submitted to structural predictions using InterproScan and AlphaFold3 to confirm both gene annotations and structural-functional conservation (Blum et al. 2021; Abramson et al. 2024). To obtain the multiple sequence alignments (MSAs), the selected RNAi genome represented were used as queries in BLASTP searches against the reference protein sequences database at the National Center for Biotechnology Information (NCBI) and organisms Manduca sexta, Plutella xylostella, and Drosophila melanogaster. Homologous sequences from related insect species alignments obtained by COBALT. For the structural alignment, we submitted candidate sequences and their homologues to AlphaFold2.0 server to provide a basis for comparison and validation, experimentally solved structures of homologous RNAi proteins from other organisms (e.g. from Drosophila melanogaster or other Lepidoptera) were downloaded from the Protein Data Bank (PDB). Both the predicted models and the experimental reference structures were imported into Chimera. This structural superposition allowed for the visualization and analysis of conserved structural cores and functional domains critical for the RNAi machinery, assessing the quality of the predictions and revealing evolutionarily maintained architectural features.
Primers for quantitative real-time PCR (RT-qPCR) reactions (Supplementary Table 1) were synthesized by Sigma® (Rpl10, Rpl18, Sid1, and C3pO) or Exxtend® (Ago1, Ago2, Dcr1, Dcr2, and R2d2) and used in amplification reactions performed in a QuantStudio 3® system (Applied Biosystems, Waltham, MA, USA). Amplification data were analyzed using LinRegPCR, removing potential biases from the sample data (Ramakers et al. 2003). Fold Change (FC) of the relative gene expression was quantified using the 2^−ΔΔCT^ method (Livak and Schmittgen 2001). The ΔC_T_ values of pupa samples were compared with larva/male/female samples from three biological experiments. Statistical significance was calculated with two-way ANOVA, followed by Tukey's post-hoc test (P-value < 0.01).
Results and discussion
Genome assembly and annotation
Blobtools analysis of our preliminary meta-genomic assembly identified only a handful of contigs (N = 21) belonging to phyla other than Arthopoda (Supplementary Fig. 2). Of those identified as Arthropoda, 600 were found to correspond to classes other than Lepidoptera (Supplementary Figs 2 and 3). Removing all non-Lepidoptera reads did not meaningfully reduce the low-coverage kmers in our dataset (Supplementary Fig. 4), suggesting that they were the result of pooling multiple genetically heterogeneous individuals.
Assembling only the Lepidoptera-identified reads with hifiasm produced an assembly that was free of contamination (Supplementary Fig. 5) yet highly duplicated (>90% of odb10_lepidoptera BUSCOs; Fig. 2). Five iterations of purge_dups were able to reduce the duplicated BUSCO rate to 10.5% (Fig. 2 and Supplementary Fig. 6), consistent with previous studies that found it suitable for reducing spurious duplication in pooled-sample assemblies (Goldberg et al. 2024); however, this final duplication rate is still much higher than expected indicating that some haplotigs remain in our final assembly. This is further suggested by the presence of some low coverage contigs (<10×; Supplementary Fig. 5).
Results of BUSCO analysis for the raw assembly produced by hifiasm (top bar), the final assembly after 5× rounds of purge_dups and polishing with inspector (middle bar), and the structural annotation produced by helixer (bottom row).
Our final assembly remained highly fragmented by the standards of modern long-read sequencing (number of contigs = 1615; largest contig = 6.3Mb; total size = 370.6Mb; Table 1, Supplementary Table 2) and slightly smaller than our kmer estimate (399.7 Mb; Supplementary Fig. 4); however, it is important to note that our estimate is likely inflated by the presence of multiple haplotypes in our dataset and the true genome size is much smaller than our total assembly size. Indeed, the size of our genome assembly is larger than our flow cytometry measurement (294.6225 Mbp; mean nuclear value = 2C = 0.603pg) consistent with the presence of haplotypic duplication. Both our flow cytometry measurement and the size of our assembly are well within the range expected based on other lepidopteran genomes (e.g. Manduca sexta = 470 Mb, Gershman et al. 2021; Plutella xylostella = 323 Mb, Boyes et al. 2023; Pieris rapae = 246 Mb, Shen et al. 2016).
Inspector, which assesses genome quality and polishes errors by mapping raw reads to the assembly, found that our assembly was error-prone (Nsmall-scale errors = 2.03M; 5,482 per Mb; Supplementary Table 1), which is typical of pooled-sample assemblies (Goldberg et al. 2024), yet the polishing algorithm was able to significantly reduce this (Nsmall-scale errors = 190k; 512 per Mb). Despite these errors, BUSCO analysis found that most gene content was still represented (BUSCO complete = 91.6%; Table 1). This was further supported by high mapping rates for our six RNA-seq samples (mean mapping rate = 86%; Supplementary Table 3). RepeatMasker found that 46.44% of our assembly was composed of repetitive elements (Supplementary Table 4), predominantly retroelements (20.91% of total length) and unclassified elements (18.14% of total length); however, our aggressive purging of duplicate haplotigs is likely to have collapsed repetitive regions of the genome, as evidenced by the presence of some very high coverage contigs (>100×; Supplementary Fig. 5), thus the real repeat content of the L. coffeella genome may be higher.
Helixer was able to annotate 17,467 genes in our assembly, with 91.2% of protein BUSCOs complete; 13,725 of these genes were able to be functionally annotated. This speaks to the ability of helixer to identify lepidopteran genes, and is likely due to the over-representation of holometabola (especially lepidopteran and dipteran) genomes within its training corpus (details regarding Helixer training data sets can be found at Supplemental Figure S20 in Holst et al. 2025).
Identification and analysis of RNAi machinery proteins
Eggnog-mapper revealed several putative gene sequences associated with small RNA processing and transport pathways in the CLM genome (Supplementary Fig. 8): Double-stranded RNA-specific endoribonuclease (Dcr), Argonaute RISC Component 1 (Arg), Systemic RNAi-deficient-1, RNAi regulator-component 3 promoter of RISC (C3po), and dsRNA-binding protein R2D2 (R2d2). Additionally, we performed homology-based molecular modeling of selected RNAi core protein candidates to validate genomic predictions by elucidating conserved structural and functional features. Our results showed that the predicted proteins LcDCR1, LcDCR2, LcAGO1, LcAGO2, SID1, C3PO, and R2D2 (Fig. 3 and Supplementary Fig. 8) retain not only sequence homology but also their expected conformational function (Davis-Vogel et al. 2018; Arraes et al. 2021). Moreover, in order to assess the fit of the predicted genes to their models, we compared the L. coffeella sequences and structures to other lepidopteran insects, being Plutella xylostella the closest well studied relative within the same superfamily (Yponomeutoidea), and the more distantly related Manduca sexta (Bombycoidea) and Spodoptera frugiperda (Glossata), besides Drosophila melanogaster as an outgroup. We observed two distinct patterns on structural superposition: (i) highly conserved at the critical catalytic and functional domains, pinpointed at SID1 (Supplementary Fig. 9a and b) and DCR (Supplementary Fig. 9c and d); and (ii) variables regions at the flexible structural segments that potentially reflect recent evolutionary pressures exampled by LOQ (Supplementary Fig. 9e and f) and C3PO (Supplementary Fig. 9k and l). Structural analysis revealed that SID1, DCR, and AGO (Supplementary Fig. 9i and j) as the most conserved proteins across species. In contrast, HEN1 (Supplementary Fig. 9m and n) and R2D2 (Supplementary Fig. 9g and h) showed moderate conservation, while C3PO and LOQ were the least similar.
Graphical representation of the main genes in the exogenous dsRNA processing pathway involved in siRNA formation for gene silencing. AlphaFold3-predicted structures of key RNAi proteins found in Leucoptera coffeella: LcSID1 (ocher), LcDCR2 (magenta), LcLOQ (turquoise), LcR2D2 (dark green) LcAGO2 (orange), LcC3PO (light green), LcHEN (dark blue) all modeled from the L. coffeella genome. General representation of substrate molecules: dsRNA—PDB 1EBQ (multicolored), siRNA (red), mRNA (blue).
Although RNAi is a conserved mechanism in eukaryotes, variations in gene duplication, deletion, and expression across insect species affect its efficiency (Cooper et al. 2019). RNAi efficacy is influenced more by the expression levels of key enzymes—such as Dicer, Argonaute, and RISC components—than by the sheer number of core genes. The reported high variability in RNAi responses among insects suggests that analyzing the expression of RNAi-related genes could enhance our understanding of interspecies differences in RNAi efficiency and help identify key factors influencing silencing success (Koo et al. 2024).
The expression patterns of larvae, male and female compared to the pupal stage are depicted in Supplementary Fig. 7. Our qPCR results show that even candidate genes coding for closely related proteins (i.e. DCR2 and AGO2) presented different melting temperature of the amplicon and expression profiles, suggesting that the mined sequences were distinct enough to be assigned to L. coffeella as LcDcr1, LcDcr2, LcAgo1, and LcAgo2.
There are several fascinating findings within our analysis of dsRNA processing in L. coffeella. Firstly, we detected R2D2—a dsRNA-binding protein which forms a complex with Dicer to process dsRNA into siRNA thereby initiating RNAi through the loading of siRNA duplexes onto ARG proteins (Hameed et al. 2024). R2d2 has not been detected in several lepidopteran transcriptomes, suggesting a possible loss or reduced function of this gene in the order. In Bombyx mori, the R2D2 homolog is expressed at very low levels (Swevers et al. 2011). Interestingly, our study revealed LcR2d2 has a higher expression in males compared to larva, pupa and female. Although R2D2's role in the RNAi pathway is well established, its involvement in male-specific traits or processes remains largely uncharacterized. There is currently no information about the implication of R2D2 in male-specific development or reproduction. Second, LcSid1—a transmembrane protein involved in double-stranded RNA uptake (Saakre et al. 2024), exhibited an overexpression peak in adult males, with mean fold change values significantly higher than those observed in immature stages and females. Conversely, expression analyses in Ostrinia nubilalis showed that R2d2 transcripts were present across all developmental stages, with expression levels significantly lower in males than in females (Cooper et al. 2021). Also, in contrast to L. coffeella, Sid-1 expression in Nilaparvata lugens was lower in males compared to females (Zha et al. 2011). Altogether, these results suggest that RNAi pathways may be involved in sex-specific regulatory dynamics that warrant further investigation, especially when determining possible targets for novel biopesticides.
Concluding remarks
In light of rapid environmental changes and biodiversity loss, especially within the highly diverse tropics, the need to protect agriculture with novel pest management solutions has never been greater. RNAi-based biopesticides are a promising solution but require a substantial amount of research into the biology of target species to develop. By generating a draft genome assembly with PacBio long-reads, we have taken a crucial first step toward the goal of protecting coffee production from its most damaging pest, the CLM. Furthermore, we characterized the structures and expression of key RNAi pathway genes within our genome and confirmed the presence of RNAi machinery within L. coffeella. This establishes targeted biopesticides as a solution worthy of continued research and development. Notably, the serendipitous discovery of sex-biased expression in these genes points to previously undiscovered roles for RNAi in lepidopteran biology. As such, our study—and the underlying data—will serve as a valuable resource for future research, both fundamental and applied.
Supplementary Material
jkag003_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abramson J et al 2024. Accurate structure prediction of biomolecular interactions with Alpha Fold 3. Nature. 630:493–500. 10.1038/s 41586-024-07487-w.38718835 PMC 11168924 · doi ↗ · pubmed ↗
- 2Arraes FBM et al 2021. Dissecting protein domain variability in the core RNA interference machinery of five insect orders. RNA Biol. 18:1653–1681. 10.1080/15476286.2020.1861816.33302789 PMC 8582983 · doi ↗ · pubmed ↗
- 3Barathi S, Sabapathi N, Kandasamy S, Lee J. 2024. Present status of insecticide impacts and eco-friendly approaches for remediation-a review. Environ Res. 240:117432. 10.1016/j.envres.2023.117432.37865327 · doi ↗ · pubmed ↗
- 4Benoit G et al 2024. High-quality metagenome assembly from long accurate reads with meta MDBG. Nat Biotechnol. 42:1378–1383. 10.1038/s 41587-023-01983-6.38168989 PMC 11392814 · doi ↗ · pubmed ↗
- 5Blum M et al 2021. The Inter Pro protein families and domains database: 20 years on. Nucleic Acids Res. 49:D 344–D 354. 10.1093/nar/gkaa 977.33156333 PMC 7778928 · doi ↗ · pubmed ↗
- 6Camacho C et al 2009. BLAST+: architecture and applications. BMC Bioinform. 10:1–9. 10.1186/1471-2105-10-421.PMC 280385720003500 · doi ↗ · pubmed ↗
- 7Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. 2021. egg NOG-mapper v 2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 38:5825–5829. 10.1093/molbev/msab 293.34597405 PMC 8662613 · doi ↗ · pubmed ↗
- 8Challis R, Richards E, Rajan J, Cochrane G, Blaxter M. 2020. Blob Tool Kit—interactive quality assessment of genome assemblies. G 3 Genes|Genomes|Genetics. 10:1361–1374. 10.1534/g 3.119.400908.32071071 PMC 7144090 · doi ↗ · pubmed ↗
