Methods for safely sharing dual-use genetic data

Sterling Sawaya; Chien-Chi Lo; Po-E Li; Blake Hovde; Patrick Chain

PMC · DOI:10.3389/fmicb.2026.1716431·February 11, 2026

Methods for safely sharing dual-use genetic data

Sterling Sawaya, Chien-Chi Lo, Po-E Li, Blake Hovde, Patrick Chain

PDF

Open Access

TL;DR

This paper introduces methods to securely share genetic data by obscuring sensitive details, preventing malicious use while preserving broad genomic insights.

Contribution

A novel method for obfuscating raw sequence data by pooling reads to prevent reconstruction of individual samples.

Findings

01

Pooling reads from multiple samples prevents full reconstruction of any individual sample.

02

Genomic information remains usable at a broad scale while fine-scale details are obscured.

03

Regions of a genome can be selectively removed to further restrict access to sensitive data.

Abstract

Some genetic data has dual-use potential. Sharing pathogen data has shown tremendous value. For example therapeutic development and lineage tracking during the COVID pandemic. This data sharing is complicated by the fact that these data have the potential to be used for harm. The genome sequence of a pathogen can be used to enable malicious genetic engineering approaches or to recreate the pathogen from synthetic DNA. Standard data security methods can be applied to genetic data, but when data is shared between institutions, ensuring appropriate security can be difficult. Sensitive data that is shared internationally among a wide array of institutions can be especially difficult to control. Methods for securely storing and sharing genetic data with potential for dual-use are needed to mitigate this potential harm. Here we propose new methods that allow genetic data to be shared in a…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species4

Homo sapiens(human · species)Severe acute respiratory syndrome coronavirus 2(no rank)Bacillus anthracis(anthrax bacterium · species)Monkeypox virus(no rank)

Diseases4

C-CL Influenza COVID infectious disease

Figures5

Click any figure to enlarge with its caption.

Workflow for methods used to obfuscate genomic data. (A) Method for pooling reads to generate a secure FASTQ file. FASTQ files were pooled and their order randomized. (B) Method for removing genomic region(s) to limit information within a set of (pooled) samples. Samples, or a pool of samples from panel (A), are aligned to an incomplete reference genome to produce a file that lacks complete genomic information.

Prediction of variants present for each sample and for the pooled datasets. For each group examined, Freyja was used to predict variants present within each individual sample (left stacked histogram in each subplot), and the proportion of each variant in the pooled sample represents the ratio of reads from that sample used in the pooled dataset. The right stacked histogram for each subplot represents the variants present within the pooled datasets as predicted by Freyja.

Prediction of lineages present within samples and within the pooled datasets. For each group examined, Freyja was used to predict lineages present within each individual dataset (left stacked histogram in each subplot), and the proportion of each lineage in the pooled dataset represents the ratio of reads from that dataset used in the pooled dataset. The right stacked histogram for each subplot represents the lineages present within the pooled datasets as predicted by Freyja. Only higher frequency variants are shown and as the pool grows larger there are too many variants to accurately plot. These are plotted as “other” and represent a larger portion of the pool as the number of datasets grows.

Pooled B. anthracis 2-samples SNVs and individual sample SNVs. (A) Venn diagram (https://bioinfogp.cnb.csic.es/tools/venny/). The variants present within two individual accessions (SRR5811050 and SRR13950249) were compared to determine which variants overlap between the different accessions and also between the variants called in the pooled accessions. (B) Heatmap. Y-axis represents SNV position clustering. Heatmap color represents the SNVs ratio which is defined as the alternate base versus the reference base ratio.

Pooled MPXV 2-samples SNVs and individual sample SNVs. (A) Venn diagram (https://bioinfogp.cnb.csic.es/tools/venny/). The variants present within two individual accessions (SRR30659182 and SRR21205526) were compared to determine which variants overlap between the different accessions and also between the variants called in the pooled accessions. (B) Heatmap. Y-axis represents SNV position clustering. Heatmap color represents the SNVs ratio which is defined as the alternate base versus the reference base ratio.

Keywords

cyberbiosecurityDNA sequencing securitygenome sequencinginformation securitypathogen sequencing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCRISPR and Genetic Engineering · Law, AI, and Intellectual Property · Genetics, Bioinformatics, and Biomedical Research

Full text

Background

As biotechnology advances, there is an increased potential that it can be misused (Smith and Sandbrink, 2022). Today, methods in synthetic biology can facilitate the creation of novel organisms. DNA can be accurately synthesized and this DNA can be inserted into an organism using CRISPR/CAS9 (Doudna and Charpentier, 2014). Entire portions of a genome can either be directly synthesized or be rewritten to become any desired sequence. Furthermore, entire organisms can be constructed with synthetic DNA (Gibson et al., 2010). The ability to fully reconstruct a virus or bacteria with only a genome sequence has been demonstrated (Mitchell and Ellis, 2017). Although today this method is more cumbersome and expensive than simply adding or deleting genes from a genome, as synthetic biology continues to advance, these capabilities will become easier and more widespread.

The ability to recreate an organism with genetic data introduces a dilemma for those generating pathogen genome data. There is a need to share genetic information of pathogens so that their origin and evolution can be understood, and also a beneficial use of the genetic information when designing countermeasures to an infectious disease (Baker et al., 2020; Neves et al., 2023). However, because this data can be misused, there may be hesitance to share genetic data that has been collected (Dos S Ribeiro et al., 2018), or perhaps even withhold publishing to avoid pressure to share the data. For genetic data from the most dangerous pathogens, sharing data may be prohibited in some countries, especially if the data is shared internationally (Edelstein and Sane, 2015; Edelstein et al., 2018).

There are international treaties for sharing specific pathogen data, such as the Pandemic Influenza Preparedness Framework (Thompson et al., 2006). This framework encourages the sharing of genomic data from influenza, but is, however, not legally binding and its enforceability has never been tested (Rourke and Eccleston-Turner, 2021). There are ongoing negotiations to expand this approach to other pathogen data, but in general the sharing of pathogen sequence data only occurs between countries through bilateral agreements (Rourke, 2019). The ownership of genetic data and the sharing of benefits or products from genetic data are governed by the Nagoya protocol, but this treaty requires a case-by-case bilateral negotiation to make specific arrangements (Buck and Hamilton, 2011).

The standards and regulations for data sharing during a world-wide crisis have been tested by the COVID pandemic. During the initial stages of the outbreak, sharing of pathogen sequence data was essential for tracking its spread, predicting how the virus may evolve, and developing countermeasures such as mRNA vaccines (Chiara et al., 2020; Rockett et al., 2020). Without the prompt sharing of accurate sequence data, the world may not have been able to respond rapidly to this novel disease. Nevertheless, as data was openly shared, concerns about equity and benefit-sharing arose (Pratt and Bull, 2021; Shu and McCauley, 2017). These and other concerns can restrict fully open pathogen genomic data sharing during disease outbreaks.

The open sharing of scientific data stands as an ideal standard on which major scientific advances rely (Huston et al., 2019; Murray-Rust, 2008). However, this data sharing approach can conflict with the data sharing restrictions in place for the genomic data of certain pathogens. Currently, the standard method by which critical data from dangerous pathogens can be shared involves data sharing platforms with various levels of access control. While frameworks like GISAID (Shu and McCauley, 2017) and the PHA4GE Data-Sharing Accord (Griffiths et al., 2022) facilitate managed sharing by protecting submitter rights, and INSDC prioritizes “free and unrestricted access” (Karsch-Mizrachi et al., 2025), none of these platforms technically prevent the reconstruction of dangerous pathogens by actors with legitimate access. Consequently, pathogen data is either siloed, so no genetic data is shared, or the data is wholly shared, allowing anyone with access to the data to recreate the pathogen (Vinatzer et al., 2019). New, safer methods for accessing the sequence data of pathogens are needed. The methods proposed here provide a bridge between fully replicable data and data that remains entirely hidden in silos, enabling data sharing for legitimate research purposes while preventing pathogen reconstruction.

The methods here are inspired by techniques in molecular cryptography, in which cryptographic protocols are applied directly to DNA molecules to ensure that sensitive information is protected before the DNA is sequenced. Molecular cryptography can be used to mix DNA samples in such a way that a key is needed to determine which read from a DNA sequencer belongs to which sample (Sawaya, 2017). Without the key, data from the resulting pool is obfuscated. A similar approach can be done with software, in which reads from multiple files are pooled together and the meta-data is stripped. This approach utilizes the pooling breakthrough that allows high-throughput methods to scale next-generation sequencing but here we intentionally do not allow the data in the pool to be directly attributed to an individual sample. Using software, we examine the use of this approach. To produce output that is similar to molecular cryptography, our software adds a random string of nucleotides (the keys) to the sequence identifier line in the FASTQ file. The software also strips the sequence identifier line of any previous information and produces a key file that contains the keys tied to their original sequence identifier line and FASTQ file name. Here, we propose that this new method allows for sharing genetic sequence data while protecting sensitive aspects of the data.

Materials and methods

We developed new methods for formatting genetic data that prevents anyone with access to the data from being able to reconstruct an entire genome (Figure 1). The first proposed method involves the following (Figure 1B): DNA sequencing reads from a sample (or group of samples) are aligned to an incomplete reference genome. A substantial portion of this genome must be removed (e.g., removal of a full gene). The resulting alignment file (e.g., in SAM format) can then be processed into a variant call file (VCF) or can be shared directly. The aligned file or the VCF will be insufficient to recreate the full genome of the sample, as sensitive sequence regions that were removed are not included in the alignment of assembly files.

Workflow for methods used to obfuscate genomic data. (A) Method for pooling reads to generate a secure FASTQ file. FASTQ files were pooled and their order randomized. (B) Method for removing genomic region(s) to limit information within a set of (pooled) samples. Samples, or a pool of samples from panel (A), are aligned to an incomplete reference genome to produce a file that lacks complete genomic information.

Our second proposed method is more sophisticated, and can occur prior to the application of the first proposed method (Figure 1A). Here, we pool DNA sequence reads from multiple FASTQ datasets that originate from DNA samples of the same organism and the metadata from each read is stripped. The order of the reads in the pooled data is then randomized such that reads that originate from an individual sample are distributed throughout the pool. The resulting file remains in FASTQ format and contains all of the reads from the data of the original samples but paired-end read information is lost. The files with pooled reads can be directly processed into VCF files by mapping reads to a reference genome and calling variants. For additional security, the pooled data can also be processed with the first proposed method (aligning to an incomplete reference genome). This process also strips sample information from individual reads, adds a random identifier, and creates a separate mapping file (key file) that can be used to reverse the process. The software implementation of this pooling method is publicly available at https://github.com/Geneinfosec-Inc/ReadMixer.

To examine the read pooling method, we applied it to FASTQ data from SARS-CoV-2, from Monkeypox virus (MPXV), and from Bacillus anthracis. For SARS-CoV-2, we generated 7 independent datasets in FASTQ format, consisting of a unique mixture of 2, 5, 10, 50, 100, and 500 sequencing datasets obtained from CDC (see Supplementary Table 1). For B. anthracis we generated 4 distinct datasets, consisting of 2, 5, 10 and 50 isolates, randomly subselected from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). (see Supplementary Table 2) with search criteria “Organism = Bacillus anthracis” and “platform = Illumina” and “layout = paired.” Four distinct Monkeypox virus (MPXV) datasets were generated by pooling 2, 5, 10, and 50 isolates, each randomly selected from the 2,119 NCBI SRA samples (see Supplementary Table 3).

The metadata from each dataset, including the depth of coverage and strain variants/lineages, were utilized as the ground truth for the composition/abundance of each pooled dataset. For SARS-CoV-2, the EDGE COVID-19, build 20230131 (Lo et al., 2022), was used for variant calling and lineage assignment was done with Pangolin (v4.3.1) (O’Toole et al., 2021). For B. anthracis, the EDGE, build 20231115 (Li et al., 2017) was used for read-mapping to the reference genome Bacillus anthracis str. “Ames Ancestor” (AE017334.2) and for variant calling.

Freyja v1.4.8 (Karthikeyan et al., 2022) was used for deconvoluting the variants/lineages in the datasets from the VCF files from the pooled SARS-CoV-2 data. This program was designed to deconvolute genomes for “mixed” samples, with the ability to predict the SARS-CoV-2 lineages present within the mixed samples. We used Freyja to predict the proportion of SARS-CoV-2 variants and lineages in the datasets, for both pooled data and for individual datasets. We also bootstrapped the reads from the pool of 2 and 10 datasets, with 1000 replicates, and ran Freyja on these bootstrapped pools to predict strain variants.

To examine the pooled B. anthracis and MPXV data, we generated variant frequency heatmaps using the Plotly dash-bio.Clustergram package (Plotly, 2024). The Euclidean pairwise distance (Dokmanic et al., 2015) and complete linkage clustering algorithms (Großwendt and Röglin, 2017) were employed to the variant calling result based on variant locations. By applying these algorithms to the variant calling results, we can identify groups of variants with similar frequency patterns across different genomic locations. This clustering approach helps to identify, visualize, and interpret the relationships between variants in the B. anthracis and MPXV data, potentially revealing insights into genetic similarities or differences among the different treatments.

Results

By examining pooled data for larger and larger pools of sequencing datasets (FASTQ reads), we tested the ability of our pooling method to obfuscate genomic data using established bioinformatics methods. We estimated the composition of the pooled data using these methods and compared the results to the true composition based on the samples used to generate those pools.

For the SARS-CoV-2 samples, the EDGE COVID-19 workflow was used to provide majority-rule based variant calls (Lo et al., 2022). Consequently, regardless of whether the SARS-CoV-2 reads were pooled or whether datasets were analyzed separately, the same single nucleotide polymorphism variants (SNVs) and insertion/deletion (INDELs) were found. Freyja, another tool designed for SARS-CoV-2 (Karthikeyan et al., 2022), provides a population level analysis of the pooled reads. Freyja was successfully able to estimate the approximate proportion of variants present within the pooled data (Figure 2). Note that the samples used to generate the pool of 5 datasets did not contain any Delta variant, being entirely composed of the Omicron variant, so a comparison of variant calling was not possible with this group.

Prediction of variants present for each sample and for the pooled datasets. For each group examined, Freyja was used to predict variants present within each individual sample (left stacked histogram in each subplot), and the proportion of each variant in the pooled sample represents the ratio of reads from that sample used in the pooled dataset. The right stacked histogram for each subplot represents the variants present within the pooled datasets as predicted by Freyja.

Freyja demonstrated a slight bias when predicting the fraction of the variant in the pool, with a tendency to estimate a slightly higher fraction of the common variant. We bootstrapped the pool of 2 and 10 datasets for 1000 replicates to examine the variance in this trend. We found this trend also existed in the bootstrapped pools. The mean fraction of Delta in the pools of 2 and 10 were 0.84 and 0.66, while the bootstrapped values were 0.856 ± 0.007 and 0.68 ± 0.01, respectively.

While variants were accurately detected by Freyja, the lineages identified were not always correct (Figure 3). The prediction of lineages in Freyja produced primarily false positives, even when only two datasets were pooled. Freyja’s ability to predict the presence of lineages in pooled data had diminishing accuracy as more datasets were pooled together. Notably, Freyja tended to overestimate the number of lineages in the pooled data, and this overestimate increased as more datasets were pooled together.

Prediction of lineages present within samples and within the pooled datasets. For each group examined, Freyja was used to predict lineages present within each individual dataset (left stacked histogram in each subplot), and the proportion of each lineage in the pooled dataset represents the ratio of reads from that dataset used in the pooled dataset. The right stacked histogram for each subplot represents the lineages present within the pooled datasets as predicted by Freyja. Only higher frequency variants are shown and as the pool grows larger there are too many variants to accurately plot. These are plotted as “other” and represent a larger portion of the pool as the number of datasets grows.

While the community is familiar with delineating variants in viral outbreaks, the definition of a lineage within bacterial species is not as straightforward. Due to the lack of sublineage definitions, and also, since there exists no similar program to Freyja for bacteria, we took a different approach to demonstrate that our method provides obfuscation of bacterial genomes.

For B. anthracis we generated 4 distinct datasets, consisting of 2, 5, 10 and 50 isolates, for a total 67 SRA sequencing datasets (accessions for whole genome sequences). When the 67 individual sequencing datasets of B. anthracis were individually mapped to the reference genome, an average genome coverage ranging from 97.04% to 99.06% was achieved. The variant calling analysis revealed a SNV count ranging from 8 to 2,653. To assess the pooled B. anthracis data, we compared the SNV positions in the pools with those from individual sequencing datasets. Figure 4A presents a Venn diagram illustrating the SNVs found in the pooled 2-sequencing datasets relative to those found in the individual sequencing datasets alone (SRR5811050 with a mean depth 54× and SRR13950249 with a mean depth 203×). The pooled sequencing datasets share 1,115 SNVs with the individual sequencing datasets. However, the pooling of two sequencing datasets introduced three false positive SNVs that were not present in the individual sequencing datasets analyses. These false positives occurred because, although the alternative base ratios in the individual sequencing datasets were below the detection threshold at these positions, their combined ratios in the pooled datasets exceeded the threshold and were detected by the variant calling algorithm. Conversely, each individual dataset contains 1,307 SNVs that were not detected in the pooled data, representing a high false negative (FN) rate in the pooled analysis. This high rate is likely due to two factors: (1) pooling increased the depth coverage of non-variant regions, diluting the signal of true variants (1,196 FN in SRR13950249 and 53 FN in SRR5811050), and (2) the pooling method removed paired-end information, preventing the mapping algorithm from rescuing reads that did not perfectly align due to SNVs (58 FN shared by two individual datasets). Figure 4B depicts a cluster analysis heatmap of the pooling of 2 datasets, along with data from the two individual datasets. The vertical dendrogram on the left of Figure 4B illustrates the hierarchical clustering of the datasets, revealing patterns of similarity. The heatmap itself shows variations in alternate base ratios across different SNV positions, with color gradients indicating the frequency of these variants. As seen in Figure 4B, although the SNV positions in the pooled datasets overlap with those in the individual datasets, there are differences in the alternative base ratios for other SNV positions. Notably, the top cluster highlights a “dilution” effect for false negative SNVs, where pooling reduces variant signals below detection thresholds. This result demonstrates that combining reads from different datasets can obscure variant information, leading to both false positives and false negatives.

Pooled B. anthracis 2-samples SNVs and individual sample SNVs. (A) Venn diagram (https://bioinfogp.cnb.csic.es/tools/venny/). The variants present within two individual accessions (SRR5811050 and SRR13950249) were compared to determine which variants overlap between the different accessions and also between the variants called in the pooled accessions. (B) Heatmap. Y-axis represents SNV position clustering. Heatmap color represents the SNVs ratio which is defined as the alternate base versus the reference base ratio.

The Monkeypox virus analysis further demonstrates how pooling reads effectively obscures sample-specific genomic signals in organisms whose genome sizes fall between SARS-CoV-2 (∼30 kb) and bacteria (∼5 Mb). MPXV’s 197-kb-long double-stranded DNA genome provides an intermediate test case, enabling evaluation of how pooling behaves for large viral genomes that exhibit moderate genomic complexity relative to RNA viruses. As shown in the 2-sample Venn diagram (Figure 5A), all variants detected in the pooled sample were also present in the contributing isolates. Notably, the pooled dataset contained no SNVs unique to the pool, indicating that, unlike the B. anthracis case, the pooling procedure did not generate false-positive variants that exceed detection thresholds. This behavior is consistent with MPXV’s relatively low mutation rate and more homogeneous variant landscape (Scarpa et al., 2022), which reduce the likelihood of low-frequency alleles combining to form artifact variants.

Pooled MPXV 2-samples SNVs and individual sample SNVs. (A) Venn diagram (https://bioinfogp.cnb.csic.es/tools/venny/). The variants present within two individual accessions (SRR30659182 and SRR21205526) were compared to determine which variants overlap between the different accessions and also between the variants called in the pooled accessions. (B) Heatmap. Y-axis represents SNV position clustering. Heatmap color represents the SNVs ratio which is defined as the alternate base versus the reference base ratio.

Despite the absence of pool-specific SNVs, the heatmap and clustering profile (Figure 5B) show that variant-frequency patterns in the pooled MPXV sample are substantially altered relative to the individual isolates. The pooled profile does not cluster tightly with either isolate and shows clear dilution of alternate-base ratios, demonstrating that quantitative variant signals are still disrupted by the pooling process. As a result, although the pool does not introduce artificial mutations, it nonetheless obfuscates isolate-specific genomic information and impedes accurate reconstruction of individual MPXV genomes. These results confirm that pooling-based data security is effective across genome sizes from small RNA viruses to large DNA viruses and bacteria, even when variant counts are limited and false positives do not arise.

Discussion

The data generated by our software has a similar format to a real-life scenario in which the read data from individual samples is not clear. Samples can contaminate each other, multiple variants of a disease can co-infect a patient, or a new, recombinant lineage can arise. The program Freyja is designed to disentangle these potential scenarios for SARS-CoV-2, making it an ideal approach by which we can test the limits of our method. Freyja was unable to accurately predict which SARS-CoV-2 lineages were present within our pooled data, indicating that our method successfully obfuscates lineage information (Figure 3). Consequently, our pooling method can prevent the reconstruction of complete SARS-CoV-2 genomes, even when only a few genomes are pooled. Freyja’s ability to accurately estimate the fraction of variants in our pooled data, on the other hand, indicates that our method does not conceal lineage-specific information (Figure 2). Recent work has shown that pooled whole-genome sequencing is an effective and inexpensive strategy for SARS-CoV-2 surveillance (Park et al., 2025). Our results demonstrate that this same pooling principle can be leveraged for biosecurity purposes, obfuscating individual genomes while preserving the population-level signal needed for surveillance.

When constructing genomes, many factors can influence variant calling. We expected that the differences between the bacterial and viral genomes examined here would influence how well our pooling method would obfuscate the genomes. Bacterial genomes are larger and more structurally complex than viral genomes and contain more variation, and thus the reconstruction of their genomes can be more difficult. Furthermore, delineation of bacteria lineages can be complicated by frequent horizontal gene transfer from distant species and large-scale genome rearrangements (Power et al., 2021; Smith and Andam, 2021). Therefore, we expected a priori that reconstructing B. anthracis genomes from the pooled sequence data would be a bigger challenge than for SARS-CoV-2. Without a custom tool like Freyja to disentangle genomes, our analysis instead focused on a comparison of variant calling between pooled data and individual datasets.

For data pooled from two B. anthracis datasets, the SNPs called in the pool did not always match the SNPs independently called from the two original datasets (Figures 4A, B). These results demonstrate that pooling data obfuscates variant calling. Accurate variant calling is needed to fully reconstruct genomes or to predict the presence of specific bacterial lineages. This effect becomes more pronounced as additional datasets are pooled (see Supplementary Figures 1–6), making it increasingly difficult to accurately call variants and fully reconstruct genomes. Note, however, that smaller regions of the genome, such as pathogenicity islands (Juhas et al., 2009), would be more readily reconstructed than whole genomes. This issue can be addressed by removing high-risk loci through an alignment to an incomplete reference genome (Figure 1B).

Similar to B. anthracis, the obfuscation from pooling MPXV datasets increases as the number of datasets increases (see Supplementary Figures 1–6). At 200 kilobases in size, the MPXV genome represents an intermediate between the small SARS-CoV-2 genome (∼50 kilobases) and the large, relatively stable genome of B. antracis (∼5 megabases). Our method’s ability to obfuscate the MPXV genomes demonstrates the effectiveness of our method on intermediate genome sizes and mutation rates, even when only 2 datasets were pooled. These results have practical applications. The MPXV recently caused a multi-country outbreak, prompting concerns about its pandemic potential. Consequently, methods that enable secure sharing of MPXV genomic data for surveillance while preventing misuse are of immediate public health relevance (Scarpa et al., 2022).

Overall, these results demonstrate that by pooling reads from the same species, the samples’ genomes become obfuscated. Not all information is concealed, however. Much of the genomic information remains: the species being sequenced, the common variants that are present in the pool, and general information about gene sequences. We also found that pooling more datasets always increases genome obfuscation, and larger genomes with more variants are more difficult to accurately reconstruct. However, the limits of this method have not been tested here. Pooling genomes of distantly related organisms will facilitate the attribution of a specific read to a specific lineage, limiting obfuscation of the sample’s genome. Furthermore, if two samples are nearly identical in their genome sequence, then pooling their sequence data will do little to change the overall composition of the pool. Additional work is required to determine exactly how sequence diversity, genome size, and the number of pooled datasets affect how information is concealed with our method. To address this uncertainty, aligning reads to an incomplete reference genome (Figure 1A) can always be used to ensure that full reconstruction of a genome is not possible.

This work can be expanded to build standardized workflows for using incomplete reference genomes, as well as building variant call files and obfuscated consensus genomes from the pooled datasets. Any downstream applications using our data format will have equal or better data obfuscation than the pooled datasets. These data can also be used on the available platforms for pathogen data sharing (Griffiths et al., 2022; Karsch-Mizrachi et al., 2025; Shu and McCauley, 2017), however some adaptations may be required. For example, downstream obfuscated data would need to be clearly demarcated to ensure that users of the platform are aware that the sequence data had been obfuscated prior to analysis. Additionally, future iterations of this method could use the key file provided by our software to map specific reads back to their original datasets. This would permit full sample data retrieval when authorized, mirroring the functionality of molecular cryptography. Integrating our methods into these established data sharing platforms would offer a secure pathway to share high-risk pathogen data.

Many other types of genetic data are widely shared. For example, human genomic data can be shared to further research on human genomics or to develop new pharmaceutical drugs (Alvarellos et al., 2023). The sharing of human genomic data is complicated by issues of identifiability of the data (Gymrek et al., 2013) and the consequent violations of privacy (Bonomi et al., 2020; Erlich and Narayanan, 2014; McGuire, 2008). Human genetic data also has potential security issues that result from the fact that the people from whom the data is generated will be alive for decades and future technological advancements in biotechnology and AI may facilitate the misuse of this data (Sawaya et al., 2012). Like with pathogen data, concerns about the insecure sharing of human genomic data may ultimately lead to limitations on how data is shared.

The application of our data security techniques to other types of genomics data, like human data, represents a potential future direction of this work. Although pools made from larger genomes may require specialized variant calling pipelines to avoid false negatives, pooling genetic data has been theoretically proven to increase privacy regardless of genome complexity (Sawaya, 2017). Numerous techniques have been developed for human genetic data privacy and security (Oliva et al., 2024), including methods of genomic sequence obfuscation that utilize sequence similarity (Wan and Wang, 2022) or variant masking (Hekel et al., 2021). Other privacy and security methods can be applied to summary statistics obtained from human genomic data (Kamm et al., 2013; Wan et al., 2017). In contrast, our method is designed to obfuscate and mask dual-use genomic data for biosecurity applications. Furthermore, by operating on raw sequence data (FASTQ), our method can be integrated with downstream security techniques that operate on genome sequences or variants.

Conclusion

The genomes analyzed here are of importance to biosecurity, and this method can have widespread applications to genomic data of other pathogens. This method can also be applied to larger genomes, such as human, to obfuscate sensitive data for privacy and security. Pooling genomic data, or equally, pooling genomic material, results in a mixture that prevents full reconstruction of a genome. This method, potentially combined with methods that remove some regions of the genome, can nevertheless allow useful genomic data to be shared for tasks such as population-level variant monitoring and pathogen surveillance (Park et al., 2025). The computational methods used here also have the potential to guide and optimize methods in molecular cryptography. The effectiveness of different pooling methods can be examined with this software, and the results can be used to design molecular cryptography.

By sharing raw data in a way that prevents misuse, analyses can be done to ensure an accurate interpretation of the data. The use of genetic data, in pathogen surveillance and elsewhere, can be essential for biosecurity, for the development of the bioeconomy, and the advancement of medicine. This data sharing can be stifled by the threat of misuse. The availability of raw data can help ensure the accuracy, authenticity and interpretation of any genomic analyses. Even when raw data is not to be shared, securing information within raw data itself provides security for the data at rest, and ensures that any information extracted from the data remains equally secure.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alvarellos M. Sheppard H. E. Knarston I. Davison C. Raine N. Seeger T. (2023). Democratizing clinical-genomic data: How federated platforms can promote benefits sharing in genomics. Front. Genet. 13:1045450. 10.3389/fgene.2022.1045450 36704354 PMC 9871385 · doi ↗ · pubmed ↗
2Baker D. van den Beek M. Blankenberg D. Bouvier D. Chilton J. Coraor N. (2020). No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. P Lo S Pathog 16:e 1008643. 10.1371/journal.ppat.1008643 32790776 PMC 7425854 · doi ↗ · pubmed ↗
3Bonomi L. Huang Y. Ohno-Machado L. (2020). Privacy challenges and research opportunities for genomic data sharing. Nat. Genet. 52 646–654. 10.1038/s 41588-020-0651-0 32601475 PMC 7761157 · doi ↗ · pubmed ↗
4Buck M. Hamilton C. (2011). The nagoya protocol on access to genetic resources and the fair and equitable sharing of benefits arising from their utilization to the convention on biological diversity. Rev. Eur. Commun. Intern. Environ. Law 20 47–61. 10.1111/j.1467-9388.2011.00703.x · doi ↗
5Chiara M. D’Erchia A. M. Gissi C. Manzari C. Parisi A. Resta N. (2020). Next generation sequencing of SARS-Co V-2 genomes: Challenges, applications and opportunities. Brief Bioinform. 22 616–630. 10.1093/bib/bbaa 297 33279989 PMC 7799330 · doi ↗ · pubmed ↗
6Dokmanic I. Parhizkar R. Ranieri J. Vetterli M. (2015). Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Magazine 32 12–30. 10.1109/MSP.2015.2398954 · doi ↗
7Dos S Ribeiro C. Koopmans M. P. Haringhuizen G. B. (2018). Threats to timely sharing of pathogen sequence data. Science 362 404–406. 10.1126/science.aau 5229 30361362 · doi ↗ · pubmed ↗
8Doudna J. A. Charpentier E. (2014). Genome editing. The new frontier of genome engineering with CRISPR-Cas 9. Science 346 1258096–1258096. 10.1126/science.1258096 25430774 · doi ↗ · pubmed ↗