Highly Secure In Vivo DNA Data Storage Driven by Genomic Dynamics
Jiaxin Xu, Yu Wang, Haibo Zhou, Mingen Li, Yang Wang, Lingwei Wang, Hui Mei, Junbiao Dai, Shanze Chen, Xiaoluo Huang

TL;DR
This paper introduces a new method for secure DNA data storage in living organisms by using biological and computational systems to enhance encryption.
Contribution
The novel contribution is a unified paradigm called ICBP that uses genomic dynamics to expand encryption key space by over 100 orders of magnitude.
Findings
ICBP successfully encrypts, stores, and decrypts digital files in living systems with 100% data recovery after 100 generations.
The encryption method resists brute force and statistical attacks due to its use of dynamic code tables from gene regulatory networks or genomes.
Storing code tables in synthetic genes or genomes adds an additional layer of security through biological complexity.
Abstract
DNA is a promising medium for next‐generation data storage because of ultrahigh information density and stability. DNA storage within living organisms presents further advantages, such as self‐replication, compactness, and concealment. Early efforts primarily developed predetermined methods for encoding and decoding data using in vivo DNA sequences. However, these methods may pose a security risk while opening a clear channel for potential data access and breaches. To address these challenges, we propose a unified paradigm, integrated computational–biological programming (ICBP), by exploiting the intrinsic digital characteristics within computational and microbial systems. ICBP involves the construction of dynamic code tables from gene regulatory networks or complete genomes across diverse species, expanding the key space by more than 100 orders of magnitude compared with existing…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
FIGURE 1
FIGURE 2
FIGURE 3
FIGURE 4
FIGURE 5| Images | Plain | Encrypted | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H | H | GVD | EQ | |||||||||
| R | G | B | R | G | B | R | G | B | R | G | B | |
| Chest | 4.961 | 5.954 | 4.961 | 7.410 | 7.412 | 7.414 | 0.995 | 0.996 | 0.995 | 2171.25 | 1761.656 | 2168.516 |
| Cloud | 7.454 | 7.245 | 6.667 | 7.785 | 7.784 | 7.785 | 1.000 | 1.000 | 1.000 | 5472.422 | 6434.531 | 11369.16 |
| Orange | 7.532 | 7.734 | 5.854 | 7.959 | 7.781 | 7.967 | 0.716 | 0.738 | 0.747 | 826.984 | 954.188 | 1463.900 |
| Lung | 6.486 | 6.486 | 6.486 | 7.713 | 7.713 | 7.712 | 0.971 | 0.971 | 0.971 | 752.313 | 751.164 | 748.422 |
| Building | 7.757 | 7.754 | 7.812 | 7.970 | 7.970 | 7.970 | 0.947 | 0.947 | 0.948 | 10052.05 | 10192.41 | 7604.195 |
| Peppers | 7.253 | 7.594 | 6.968 | 7.979 | 7.980 | 7.980 | 0.976 | 0.967 | 0.978 | 831.27 | 640.41 | 994.73 |
| Micro organism | Bio‐strategy | Digital operation | Sequencing platform | Full recovery | Sequencing error rate | Sequencing coverage | Encryption strategy | Dynamic codewords | Encryption potential* | |
|---|---|---|---|---|---|---|---|---|---|---|
| Chen et al. [ |
| Artificial chromosome | Error‐correction coding, random interleaver, sparsification, XOR, transcoding | Nanopore sequencing | Yes | NM | ≥16.8× | NC | No | NC |
| Shipman et al. [ |
| CRISPR array | Oixel‐value‐encoding, rigid and flexible strategy | Illumina MiSeq | No | NM | NM | NC | No | NC |
| Luo et al. [ |
| Fluorescent protein gene | Run‐length compression, base‐4 conversion, randomized encryption | Sanger sequencing | Yes | NDE | Once | NC | No | NC |
| Sun et al. [ |
| Recombinase‐based site‐specific genome engineering | RS code, RaptorQ code, tertiary RS repair | Sanger; noisy nanopore sequencing | Yes |
NDE; ∼10% |
Once; 603× | NC | No | NC |
| Hou et al. [ |
| CRISPR‐Cas9 | Compression mapping, extension mapping | Sanger sequencing | Yes | NDE | Once | NC | No | NC |
| Ping et al. [ |
| Plasmid construction and transformation | Yin‐Yang codec system | Sanger sequencing;DNBSEQ | Yes |
NDE; NM |
Once; NM for DNBSEQ | Codec rules | No | NC |
| Huang et al. [ |
| — | Wukong codec | Sanger sequencing | Yes | NDE | Once | Codec rules | No | Limited |
| Zhang et al. [ |
| Plasmid construction and transformation | Wukong codec | Sanger sequencing | Yes | NDE | Once | Codec rules | No | limited |
| Yang et al. [ |
| 34 phage integrases | NC | Sanger sequencing | NM | NDE | NM | NC | No | NC |
| Zhang et al. [ |
| Plasmid transformation | LZW and RS encoding. | Sanger sequencing; Illumina | Yes |
NDE; ∼0.1% |
Once; 300× | NC | No | NC |
| Liu et al. [ |
| CRISPR‐Cas12a and phage‐derived recombinases | Base64, Huffman | Sanger sequencing; Illumina; MGIseq | Yes | NDE | Once for Sanger, NM for NGS | NC | No | NC |
| This work |
| Gene regulatory, entire genome, Plasmid construction and transformation | Rossler system, Dynamic code tables, Logistic map, Sine map, PWLCM map, XOR, Wukong codec | Sanger sequencing | Yes | NDE | Once | Genomic Dynamics, Chaos, Codec Rules | Yes | Significantly expanded encryption capacity |
- —National Key Research and Development Program of China10.13039/501100012166
- —National Natural Science Foundation of China10.13039/501100001809
- —Shenzhen Medical Academy of Research and Translation
- —Natural Science Foundation of Guangdong Province10.13039/501100003453
- —Shenzhen Science and Technology Program10.13039/501100017610
- —Shenzhen Clinical Research Center for Respiratory Disease
- —Shenzhen Key Laboratory of Respiratory Diseases
- —Innovation Program of Chinese Academy of Agricultural Sciences and Shenzhen Outstanding Talents Training Fund
- —Major Project of Guangzhou National Laboratory
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Cellular Automata and Applications · Gene Regulatory Network Analysis
Introduction
1
Global data are projected to reach 300 zettabytes (ZB) by 2028 according to estimations by the International Data Corporation [1]. This exponential growth has driven the development of new data‐storage paradigms capable of overcoming the limitations of conventional magnetic and silicon‐based media [2]. Traditional storage technologies are constrained by their long‐term durability, scalability, storage density, and energy efficiency. To fulfill the demand for storing explosively accumulating data volumes, attention has been increasingly focused on biological systems that offer programmable density, stability, and sustainability.
DNA has emerged as a compelling candidate for next‐generation data storage owing to its ultrahigh storage density and efficiency, long‐term stability, and minimal energy consumption [3, 4, 5]. Recent advances in synthetic DNA encoding have enabled the production of diverse digital files, ranging from text to video, to be converted into nucleotide sequences and stored in vitro. These systems exhibit storage densities exceeding those of electronic chips by more than six to seven orders of magnitude and offer the potential to preserve information for centuries [6].
Although in vitro DNA storage offers remarkable potential, recent efforts have expanded the frontier toward in vivo storage by embedding digital information directly into the genomes of living organisms [7, 8, 9]. This strategy provides not only a scalable storage system owing to its self‐replicating and self‐repairing features but also a new layer of concealment and integration with biological infrastructure [10]. Furthermore, the propagation of organisms could allow human civilization to be transmitted across generations, offering a compelling vision for long‐term cultural or scientific preservation. Sensitive data, such as personal, medical, and intellectual property, may be stored within living systems for secure preservation. Pioneering in vivo storage strategies generally fall into two categories: exogenous DNA integration, where synthetic DNA encoding digital data is inserted into the genome of biological hosts, and plasmid replication, which allows for the editing or retrieval of information within organisms. For example, Chen et al. introduced a 254‐kb artificial yeast chromosome for encoding two images and a video clip with a total size of 37,782 bytes [11]. Hou et al. developed a “Cell Disk” using CRISPR/Cas9 to support random reading/rewriting by embedding a “lock‐and‐key” into yeast cells [8]. Liu et al. reported a dual‐plasmid CRISPR‐Cas12a enabling target‐specific DNA‐based information storage and processing [12]. However, these approaches generally do not consider high‐level data encryption, which primarily relies on fixed encoding schemes to archive data, although they can withstand biological noise and offer limited protection against malicious attackers owing to their static nature.
DNA cryptography has emerged as a promising solution to address these security challenges. Previous studies have explored various encryption strategies using DNA's intrinsic characteristics, such as its mirror image version, molecular weight, and DNA origami geometries [13, 14, 15]. Other approaches involve incorporating unnatural bases to enable the biorthogonal encryption of secret information [7]. Despite these innovations, most cryptographic frameworks are designed for in vitro conditions and rely on static code mapping. They often lack adaptability to biological complexity and are limited in key space or file type flexibility, posing the risk of information leakage once the codebook is compromised, and are limited when the encoded file type varies [16, 17]. These constraints become particularly problematic when embedding sensitive information into living hosts, where the consequences of a breach could be both digital and biological.
To address these challenges, we propose integrated computational–biological programming (ICBP) (Figure 1a), a unified encryption framework that integrates the logical precision of computer science and the evolutionary complexity of microbial systems to address the escalating threat of data breaches by data encryption and storage across digital and biological platforms (Supplementary Note S1 and Figure S1) [18]. ICBP leverages chaotic cryptographic maps along with dynamic code tables derived from either gene regulatory networks or whole genomes. This hybrid system produces an exponentially large key space, offering enhanced resistance to brute force and statistical attacks (Supplementary Note S2 and Figure S2). Moreover, ICBP offers ultra‐secure flexible encryption and storage of digital data within living cells without compromising cell viability, growth, or bulk replication. This work not only represents a significant step toward integrating digital security with biological information systems but also pioneers a transformative model for encryption rooted in the synergy of computation and biology, offering a compelling blueprint for data storage in the post‐silicon era.
Secure in vivo data storage by genomic dynamics. (a) Programming between computer and microbe for highly secure data encryption. Code tables were generated from gene networks or whole genomes of diverse species, including fungi, bacteria, and viruses. Codes were recorded on computer in three forms for subsequent encryption applications. In gene network‐based method, codes and their physical order were mapped to specific positions based on gene interactions and regulatory relationships. In the genome‐based method, codes and their genetic coordinates were annotated within the corresponding genome. Alternatively, codes were synthesized into DNA sequences flanked by specific markers. All information was recorded on the computer side for retrieval when needed. Created with Biorender.com (b) Schematic flowchart of ICBP encryption algorithm and ICBP‐assisted in vivo data storage. ICBP encryption algorithm includes a combinatorial coding strategy for DNA‐based data storage, including (i) reading pixel matrix of three channels from BMP images, (ii) scrambling pixels by Rossler chaos system, (iii) flattening pixels into one dimensional vector, and converting the information into DNA sequences via combinatorial code table mapping, (iv) secondary scrambling of each DNA sequence via Logistic and Sine chaos mapping, (v) XOR operation selected by PWLCM function, and (vi) reconstructing to result an encrypted image using the manipulated pixels. ICBP‐assisted in vivo data storage involves: (vii) encoding the encrypted image with codec system (DNA encoding) and (viii) synthesizing DNA, (ix) in vivo DNA data storage.
Results
2
ICBP Encryption Designed for In Vivo DNA Data Storage
2.1
To ensure data security for in vivo systems, ICBP encryption was designed to be embedded in a traditional DNA data storage scheme (Figure 1). The critical step in ICBP is the cross‐programming between the computer and the microbe to computationally generate code tables, which involves reading and selecting code from diverse genomic origins (Figure 1a; Figure S2–S11 and Supplementary Note S1) [19]. Essentially, these code tables were regarded as “dynamic codewords” for the whole encryption pipeline. A computational program facilitates species selection and determines whether the codes are extracted from gene regulatory networks or genomes. The code table extraction workflow from a gene regulatory network is shown in Figures S3–S5. For gene regulatory networks, relevant signal transduction or metabolic pathways were selected using the Kyoto Encyclopedia of Genes and Genomes database (https://www.genome.jp/kegg/) [20]. The genes and their sequences within the chosen network were retrieved (Figures S7–S10), and codes were extracted using a random number generation algorithm applied to these sequences. For whole genome–based extraction, codes were extracted similarly by applying pseudo‐random number generation to complete the genomic sequences (Figure S11). These codes formed the code tables required for data encryption. Additionally, the code tables constituting from different genomes could be adjusted to enhance encryption strength (Supplementary Notes S1 and S2).
Once the code tables are generated, as depicted in Figure 1b, the encryption procedure begins by reading the pixel matrix of the BMP image files. This is followed by the first scrambling operation that transforms the matrix into a three‐dimensional pixel matrix using the Rossler hyperchaotic system, which is an extended version of the original Rossler chaotic system with a three‐dimensional system and includes at least two positive Lyapunov exponents. The scrambled matrix was then converted into a one‐dimensional vector, and two DNA sequences, Sa and Sb, were generated for image files by direct mapping after reading the combinatorial codes from code tables obtained from gene regulatory networks and microbial genomes (Figure 1; Tables S1–S12 and Supplementary Note S2). This biologically inspired coding method, which is completely different from computer algorithms, can markedly increase data randomness. A second scrambling operation is performed in each sequence using Logistic and Sine chaos‐based mapping methods, respectively. This sandwich‐style encryption method, which combines computer–biology–computer processes, can substantially improve the overall encryption performance of the algorithm. The resulting sequences, S'a and S'b, were subjected to a third diffusion operation using XOR rules (Table S12) randomly selected by the PWLCM. These XOR‐ed sequences were then converted back into pixels following the reverse encoding process shown in Figure 1biii, resulting in diffused pixel values compared with those of the originally obtained flattened pixels. The diffused pixel values generate an encrypted image that is subsequently encoded using a publicly available codec system to generate DNA sequences while adhering to constraints such as sequence length, GC content, and homopolymer length. These DNA sequences were then inserted into microbial hosts for in vivo storage.
This design offers a highly secure in vivo DNA data storage strategy owing to its encryption procedure that integrates computational chaos with biological complexity. The retrieval of the original information begins with PCR amplification and sequencing of the encrypted DNA sequences (Figure 2a), which are then decoded back into the encrypted image. Decryption requires reversing the combinatorial code table mapping and encryption processes, using both computational parameters and microbial storage of code tables, which are recovered depending on the preservation method, because they are both generated and stored in vivo. For synthetic DNA, code tables were organized into linear DNA sequences with specific marker sequences flanking each table. After synthesis, the sequences were inserted into plasmids and transformed into microbial hosts for storage. Sequencing allows for the retrieval of the utilized code table information. For regulatory network–based storage, codes in these tables, following their physical order, are mapped to specific positions within genes according to the gene interactions and relationships of the regulatory network. Tables were reconstructed by referencing the known gene interaction architecture and cataloged stimuli information on a computer (Table S2 and Figures S8–S10). In the genome‐based strategy, the entire genome of a species is regarded as a code–reference library (Supplementary Note S2.2.2). The codes used in the encryption tables were identified and marked within the genome, and their genetic positions were computationally documented. The gene names and positions in Tables S3–S5 were later used to reconstruct code tables when needed. Once the code tables are rebuilt, the decryption of the encrypted image is proceeded following a reverse process, as shown in Figure 1b.
Application and performance evaluation of the ICBP‐assisted in vivo data storage. (a) Schematic of information storage and retrieval. An image was encrypted using ICBP encryption algorithm and encoded by codec system to produce information‐encoded DNA sequences. The dynamically generated code tables were synthesized into DNA sequences. All DNA sequences were inserted into plasmids and transformed into E. coli for serial culture. The original image was recovered by sequencing the stored DNA, followed by the reverse process of ICBP encryption method. Created with Biorender.com (b) Accuracy of data retrieved from E. coli from various generations. (c) Identification of information‐inserted plasmids after restriction endonuclease digestion. (d) PCR verification of the information‐encoded sequences. (e) Colony formation analysis of E. coli transformed with information‐inserted plasmids. (f) Identification of information‐inserted plasmids after serial passaging from 0 to 100 generations. (g) PCR verification of information‐encoded sequences after serial passage from 0 to 100 generations. (h) Information retrieval by Sanger sequencing.
Experimental Validation of ICBP‐Assisted In Vivo DNA Data Storage
2.2
A key advantage of our method is its ability to encode encrypted data directly into DNA sequences that can be synthesized and stored in microbial cells, thereby adding an additional layer of DNA encryption. To validate this concept, an image, Minecraft, was encrypted by the proposed algorithm using two combinatorial code tables (Figure 2a). The encrypted image underwent a DNA encoding step, converting it into 31 DNA strands (Table S17) using “Wukong” [21]. (Supplementary Note S2.1.4) and inserted into plasmids for enhanced stability, sequencing availability, transformation, and passaging (Figures S12–S15). As synthetic DNA can be embedded into any organism, we stored the codes into synthetic DNA as a proof–of–concept. Code‐containing sequences were inserted into plasmids with markers at both ends of the utilized code tables to ensure that they could be readily retrieved from the synthetic genes for data decryption (Figure S12). Both information‐ and code‐table‐encoded plasmids were transferred into E. coli (Escherichia coli) cells for storage. Stability, bulk replication, and information recovery were determined using standard biological processes. The plasmids and encoded sequences were successfully amplified and size‐verified by gel electrophoresis (Figure 2b,c; Figure S14), demonstrating reliable extraction and replication of plasmids and information‐encoded sequences from E. coli cells. To further assess the robustness of the storage system, three randomly selected strains with different information sequences were serially passaged. As shown in Figure 2d, the cells grew well after passaging, indicating their capacity for bulk replication. Plasmids were harvested at 0, 10, 20, 40, 60, 80, and 100 generations and evaluated by gel electrophoresis (Figure 2e) and PCR amplification (Figure 2f), indicating long‐term stability and consistent, reliable information replication. Sanger sequencing of plasmids from each generation revealed complete alignment with the originally designed sequences (Figures S16–S18), partial chromatograms are presented in Figure 2g. The encoded data were retrieved with 100% accuracy after generation passaging (Figure 2g), validating the reliability and effectiveness of the bio‐encryption of the data by ICBP. These results demonstrate that the developed ICBP can be paired with DNA‐based data storage to enhance data security and offer approaches for the practical use of the ICBP.
Performance Evaluation of the Encryption Potential
2.3
As ICBP integrates the Rossler system, dynamic code tables, Logistic map, Sine map, PWLCM map, and XOR strategies to encrypt data, we evaluated their ability to encrypt data using in silico simulations (Supplementary Note S4). Six images containing diverse information were encrypted (Figure S19), and no discernible relationship was observed between the original and encrypted images, demonstrating the effectiveness of the algorithm. In addition, red, green, and blue (R, G, B) histograms of the original and encrypted images were analyzed by plotting the grayscale frequency distributions, which revealed significant differences between the plain and ciphered images (Figures S20–S21). The pixel distribution in these three channels was simulated before and after encryption and further processed to offer a comparative understanding of their uniformity. As shown in Figure 3a, the pixels were distributed more uniformly in each channel for the encrypted images than the corresponding original images. In addition, the correlation coefficients in the horizontal, vertical, and diagonal directions were simulated to evaluate the functionality of the developed algorithm (Figures S22–S23), with the numerical coefficients illustrated in Figure 3b and summarized in Table S13. These results demonstrate that the correlation coefficients in all directions for the original approached 1, whereas those for the encrypted images approached 0, indicating strong linear relationships in the plain images but no detectable relationship in the ciphered images. Therefore, the proposed algorithm offers good functionality for data encryption.
Characteristics of the encryption algorithm. (a) Pixel distributions of six distinct images in R, G, and B channels before and after encryption. An insertion in each subplot represents the plain image used for algorithm performance evaluation. (b) Correlation coefficient evaluation of the original and encrypted images in horizontal, vertical, and diagonal directions, respectively. Images labeled Image_0 through Image_5 correspond to the respective insertions presented in Figure 3a and are named “Chest”, “Cloud”, “Orange”, “Lung”, “Building”, and “Pepper”. “Chest” and “Lung” images are accessible in https://github.com/linhandev/dataset?tab = readme‐ov‐file; “Pepper” is accessible in the USC‐SIPI image database (http://sipi.usc.edu/database/); “Orange” is accessible in MS‐COCO dataset (https://visualqa.org/); “Cloud” and “Building” are photographed by the authors.
The security of the algorithm was further validated by simulating its randomness, information entropy (H), and gray value difference (GVD) (Table 1). The information entropy approached the maximum (8),[22, 23] and GVD approached the ideal state (1),[24] after encryption, confirming a high level of encryption complexity. In addition, this algorithm offers superior encryption quality (EQ) and strong resistance to differential attacks, including the number of pixel change rate (NPCR) and unified average changing intensity (UACI), demonstrating its high security and robustness to various attack modalities (Table 1 and Table S14). Taking “Orange” as an example, the robustness against various types and degrees of noise attack was simulated, with the peak‐to‐noise ratios (PSNR) decreasing gently in Figure 4a–c. Despite the decrease, the peaks across all R, G, and B channels remained more than 5‐fold, even when 60% of the image was cropped, or 0.6 salt and pepper noise was applied; not to mention that they remained over 6.5 under Gaussian noise variance. These results show good consistency with the simulation, which recovered images with specific noise (Figures S24–S26), confirming the high robustness of the developed encryption algorithm.
Robustness evaluation for ICBP encryption algorithm. Robustness to (a) Crop attack (10%‐60%), (b) Gaussian noise (0.1‐0.6), and (c) Salt and pepper noise (0.1‐0.6).
In data encryption and storage systems, key space refers to the total number of potential cryptographic keys that can be utilized for encrypting, decrypting, or accessing data encoded in synthetic DNA sequences [25]. Key space is a critical parameter for evaluating the security ensured by an encryption algorithm because a small key space makes the encryption vulnerable to brute‐force attacks. Therefore, we calculated the key space to compare the encryption potential of ICBP against existing encryption methods after analyzing the aforementioned general characteristics. Following the calculation process described in the Methods section, ICBP exhibited a vastly larger dynamic encoding code table size (several hundred orders of magnitude greater) and an exponentially greater key space (over 100 orders of magnitude). As shown in Table S15, the developed ICBP encryption held an exponentially greater key space by over 100 orders of magnitude than the pseudo‐random‐number‐generator‐based method developed by Yang et al,[26] and surpassed that of other methods by 200 to 400 orders of magnitude [27, 28, 29, 30, 31, 32]. In addition, the table size in the ICBP method was determined to be 184^184^, which is significantly larger than that in the existing methods (Table S15). A larger code table size and key space lead to an exponentially improved potential for data security because more attempts are required for brute force attacks [25, 33]. Furthermore, Table S15 compares the essential parameters of ICBP encryption algorithm with both conventional and recent developed approaches. While most existing algorithms primarily target image data and focus on enhancing security within that specific domain,[26, 27, 28, 29, 30, 32, 34, 35, 36] their applicability to other types remains limited. In contrast, ICBP encryption algorithm demonstrates broad compatibility across multiple formats. Overall, ICBP offers increased key space compared with previous approaches, while preserving high flexibility for diverse digital content (Supplementary Note S5 and Figures S27–S30).
Impact of Combinatorial Code Table Mapping and XOR Computing on the Performance of ICBP Encryption Algorithm
2.4
As shown in Figure 1, the ICBP encryption algorithm (Figure 1bi–vi) integrates chaotic operations with combinatorial code table mapping and XOR computing to achieve high security, unlike traditional encryption algorithms that rely on multiple chaotic or diffusion operations alone. Compared with traditional cryptographic methods, combinatorial code tables, mapping pixels or binary data into DNA nucleotides,[37, 38], and XOR operations offer synergistic data security because they can provide structural and algorithmic security, respectively [39, 40]. To evaluate the effect of combinatorial code table mapping and XOR computing on the encryption performance, we conducted a comparative analysis between ICBP encryption (Figure 1bi–vi) and its computational‐only counterpart (i.e., as displayed in Figure 1bi, ii, iv, vi, excluding DNA‐based operations). As shown in Figure 5, in silico simulations were performed to assess the four key metrics, NPCR, UACI, GVD, and EQ, for both versions. Although the pixel change rate approached 1 in both cases, encryption without DNA‐based operations (combinatorial code table mapping and XOR) exhibited relatively lower uniformity and a slightly reduced average NPCR compared with the full encryption system (Figure 5a). Similarly, Figure 5b indicates that the UACI values across different color channels were higher and more uniform when combinatorial code table mapping and XOR computing were included, indicating improved encryption diffusion using the full encryption process. For the GVD, both methods yielded comparable distributions across the various color layers. Finally, the EQ values in the R, G, and B channels were consistently higher for ICBP encryption, confirming that integrating combinatorial code table mapping and XOR computing into the encryption pipeline significantly enhanced the performance, reinforcing the importance of ICBP over traditional computational encryption methods.
Effect of combinatorial code table mapping and XOR computing on ICBP encryption. Performance comparison between ICBP and its pure computational process on (a) number of pixel change rate, (b) unified average changing intensity, (c) gradient variance deviation, and (d) encryption quality. “Encrypt” denotes encryption performed using the complete ICBP encryption algorithm (steps i‐vi in Figure 1b). “Encrypt No DNA” represents the same algorithm executed without the DNA‐specific operations, namely (iii) combinatorial code table mapping and (v) XOR in Figure 1b.
“NM” and “NC” refers to “not mentioned” and “not considered” for the in vivo DNA data storage; “Wukong” used codec rule combination for encryption, which gives a limited encryption potential; *: “This work” utilizes ICBP to incorporate both encryption algorithms and the “Wukong” codec, which further use “Dynamic Code tables” for encryption, giving it an expanded encryption capacity. ^a^: Sanger sequencing is a standard sequencing technique that typically does not involve “coverage” in the same way as NGS. Therefore, one time Sanger sequencing was used for the publications to read the information stored in vivo, with no detected error (NDE) for data decoding. ^b^: 13.6 ×, 16.8 ×, 20.4 ×, 22.6 ×, 25.4 ×, and 27.9 × sequencing coverages were analyzed, and a minimum of 16.8 × was enough for data recovery. ^c^: Error rates for the whole genome, encoding region, and subsample of non‐encoding region are 11.24%, 9.29%, and 11.53%, respectively. ^d^: an average coverage of 603× was conducted to obtain sufficient data for analyzing nanopore error patterns and error correction capacity.
Discussion
3
Over the past decades, DNA has emerged as a highly promising medium to meet rapidly increasing data volume requirements owing to its high storage density, long lifespan, and low energy consumption [3, 4, 5]. A standard DNA data storage workflow includes encoding, synthesis, storage, retrieval, sequencing, and decoding. While extensive efforts have focused on in vitro DNA data storage, in vivo strategies are attracting increasing attention [9]. Unlike in vitro methods, which typically rely on PCR‐based retrieval and are limited by degradation and lack of scalability, in vivo storage offers self‐replication and natural protection within living organisms, facilitating sustainable and easily retrievable storage solutions [7]. However, although microbial propagation enables rapid and autonomous information replication, it also raises significant security concerns, as unintended environmental release may compromise the confidentiality and integrity of the stored data [9, 48]. These concerns are particularly relevant in a data‐centric society where personal, medical, or IP information may be stored biologically. Therefore, robust encryption mechanisms are essential to protect the confidentiality, integrity, and ownership of in vivo DNA‐stored data.
This paper presents ICBP, a novel cross‐disciplinary encryption framework that bridges computational logic with microbial biology for in vivo data storage and protection. By integrating chaotic maps from computer science with code libraries derived from gene regulatory networks or complete microbial genomes, ICBP enables code table generation via genomic dynamics. Owing to the dynamic transformation and random combination of these code tables, their application in converting pixels or binary digits into DNA strands can yield variable mappings with enhanced data security. The use of dynamic code tables not only expands the key space by over 100 orders of magnitude, but also significantly enhances encryption strength against brute force and statistical attacks, thus addressing key vulnerabilities in conventional DNA‐based encryption approaches. In addition, its general performances were systematically assessed via simulations on histograms, correlation coefficients, and randomness, which are generally used to characterize newly developed encryption algorithms, thereby confirming its good functionality in encryption.
One of the central innovations of ICBP lies in its ability to exploit the digital nature of DNA sequences while maintaining compatibility with biological hosts. Most DNA encryption systems rely on static, publicly known, and operational dynamic coding systems, which render the encrypted content vulnerable once the coding scheme is compromised. In contrast, ICBP dynamically generates code tables from microbial systems, which are characterized by their inherent diversity and evolutionary adaptability, and store them in synthetic DNA, gene regulatory networks, or entire genomes. This design introduces intrinsic unpredictability and complexity into the encryption and decryption processes, thereby significantly enhancing security. Another notable advantage of ICBP is its compatibility with various data formats, including images that are often overlooked by text‐focused DNA encryption algorithms (Supplementary Note S5). By adopting a chaos‐enhanced encryption mechanism, ICBP ensures the diffusion and confusion properties necessary for secure image encryption, demonstrating robustness across multiple file types. When ICBP is applied to other file types, the original information is initially converted into binary digits and divided into three groups, followed by the same encoding and chaotic operations, as illustrated in Figure 1b iii‐ix.
To illustrate the practical utility of ICBP for in vivo DNA data storage, we proposed a potential application for encrypting and storing an image in E. coli cells to offer a secure and scalable method for ownership and IP protection (Supplementary Note S3). This ability to embed encrypted identifiers within living strains opens up new possibilities for strain authentication, biobanking, and microbial forensics. Reading the original image relies on the recovery of code tables by sequencing the synthetic DNA with flanking markers, followed by reverse encryption. A comparison with recent advances in in vivo DNA data storage is provided in Table 2, outlining key features such as biological, computational operations, codec methods, and encryption potential. While prior efforts primarily emphasized data embedding or retrieval strategies, our approach uniquely integrates a robust sandwich‐like encryption architecture within the storage scheme, highlighting the critical role of security in biologically embedded data systems. Besides, the computer‐biology‐computer sandwich‐style encryption produces vast key space that provides us with a significantly expanded encryption capacity to withstand brute‐force attacks compared with other in vivo storage techniques.
Our approach extends beyond individual species, enabling parallel data encryption using multiple types of organisms and programming across multiple biological systems and computers (Supplementary Note S2). Further extensions of this work could incorporate additional programming steps within biological systems, culminating in a fully integrated biological–computational encryption algorithm [49]. Moreover, the designed ICBP encryption system should also support secure data transmission (Figure S31 and Supplementary Note S6). This study overcomes the limitations of traditional computer‐level programming by harnessing the unique properties of both synthesized and genomic DNA. The computer–microbe hybrid paradigm represents a paradigm shift, offering a novel and inherently secure approach that promises to safeguard digital inheritance in the face of ever‐evolving cyber threats. Notably, the biological risks of in vivo DNA storage require considerable attention. Occasionally, horizontal gene transfer and the release of genetically modified microorganisms into the environment can cause biosafety problems [50]. Therefore, special containers or devices should be used to store this information for highly secure preservation. Moreover, advances in plasmid engineering and genome editing should be incorporated to mitigate these concerns in the future [51]. Future strategies can incorporate self‐destruction circuits, for example, using CRISPR–Cas systems or restriction enzymes [9]. Therefore, it is feasible that in vivo information can be destroyed rapidly and deliberately if unintended biological events arise. Beyond biosafety, data integrity is also threatened by mutation accumulation and plasmid loss over time [52]. To counteract this, data redundancy can be achieved by storing multiple copies in diverse microbial hosts or isolated locations. Furthermore, routine quality control is essential for long‐term information maintenance. This should include periodic sequencing of the plasmids after multiple passages and screening colonies to verify data integrity. As a corrective measure, the targeted resynthesis of any corrupted or lost plasmids can replenish the information repository, ensuring data longevity.
Overall, the concept of integrating computational programming with microbial genome is more than merely an academic pursuit, it represents a transformative solution to the pressing challenges of data security. Unlike conventional DNA‐based or computer‐centric encryption methods, our approach leverages the genomic dynamics of bio‐organisms and computational innovation to build a sophisticated multilayered defense to safeguard sensitive information, providing a secure scalable framework for in vivo DNA data storage.
Experimental Section
4
Key Space Evaluation for the ICBP Encryption Algorithm
4.1
The developed encryption algorithm comprises two main components: dynamic code tables and chaotic mapping. The dynamic tables consisted of seven code tables generated by programming gene regulation networks or microbe genomes, each containing 2^1^, 2^2^, and up to 2^7^ codewords (Tables S6–S12). As the two codewords in Tables S6 and S7 are identical, and the 48 codewords in Tables S11 and S12 are identical, this leads to a reduced total number of unique codewords (*C_n_ *), as calculated using Equation (1).
The total number of codewords used in the coding process was N = 256, which were mapped to a set of 256 values. The number of potential combinations (P) of these 256 codewords was calculated using Equation (2).
However, because brute force attacks are indiscriminate, they overlook redundant or duplicate codewords. Therefore, duplicate entries among the 256 codewords must be excluded in advance, resulting in the actual number of unique codewords used, being N = 184. Accordingly, the total number of codeword combinations in practical scenarios was:
Subsequently, the approximate key of the dynamic code table (*K_t_ *) is calculated using Equation (4).
In contrast, multiple chaotic mappings are utilized in the proposed algorithm, including the Rossler chaos system, Logistic, Sine, and PWLCM.
The function equations of the Rossler chaos system [53] are as follows:
where x, y, and z are the state variables of the 3D (R, G, B channel) chaotic system. α, β, ω, and λ are the parameters that influence the amplitude and stability of oscillations, determine the oscillation frequency of the system, and introduce nonlinear perturbation and govern the degree of chaos, adjust the equilibrium position of the attractor, respectively. j is the iteration of the dynamic system. To ensure the randomness of the chaotic system, the parameters are typically constrained as follows: 0.1 ≤ α ≤ 0.4, 0.1 ≤ β ≤ 0.3, 4 ≤ λ ≤ 10, ω = 1. Furthermore, in scientific computations, to ensure sufficient accuracy, a maximum of six decimal places is usually reserved, corresponding to an accuracy of 10^−6^. Therefore, the number of possible values for α, β, and λ are 4 × 10^5^, 2 × 10^5^, and 6 × 10^6^, respectively. The key space for this step is *K_f1_
- = 4.8 × 10^17^.
After combinatorial code table mapping, Logistic map, Sine map are applied to row and column sequences. The following calculations are conducted, with defining the order of nucleotides in DNA sequences as S. The Logistic map generation formula [54] is as follows:
where *S_0_
- is the initial value of this chaotic system, and it presents chaotic behavior because the system evolution is unpredictable within the interval of 0 < *S_0_ * < 1, S_0_∉{0, 0.25, 0.5, 0.75, 1.0}. *S_t_
- and *S_t+1_
- are the current state chaotic variables at iteration t and the next state generated by the logistic map. μ is the control parameter determining the system's chaotic behavior, and typically 3.57 ≤ μ ≤ 4. The number of possible values of μ and S_0_ are 4.3 × 10^5^ and 10^6^. The key space for this step is *K_f2_
- = 4.3×10^11^.
The function equation of Sine mapping [55] is as follows:
where the initial value *S_0_
- varies between 0 ≤ *S_0_ * ≤ 1. *S_n_
- and *S_n+1_
- are the current and the next state variables of the Sine map. a is the control parameter and ε = 4/a. ε is the mapping coefficient that controls the mapping strength. The system is in a chaotic state when the mapping coefficient ε lies within the intervals of (3.48, 3.72) and (3.8, 4). Therefore, the number of possible values for S_0_ is 10^6^, and that of ε is 4.4 × 10^5^. The key space for this step is *K_f3_
- = 4.4 × 10^11^.
The function equation of PWLCM mapping [56] is as follows:
In the PWLCM chaotic system, *S_m_
- and *S_m+1_
- are the current and next chaotic variables at iteration m and m+1. Control parameter p directly affects the segment position of mapping and the slope of the mapping function. When 0 < p < 1 and the initial value of 0 ≤ S_0_ ≤ 1, the system remains in a chaotic state. Therefore, the number of possible values of S_0_ is 10^6^, and that of p is 10^6^. The key space for this step is *K_f3_
- = 10^12^. In this algorithm, when the parameters of the chaotic system are used as keys for encryption, the total chaotic system key is:
Overall, the key space of the encryption algorithm is determined by both the dynamic code table and chaotic system, resulting in a total key space of:
Embedding Information Sequences and Code Tables Into Synthetic DNA and Genes
4.2
Each of 33 strands of DNA sequences was inserted into the plasmids for higher stability (Table S17). An amount of 400 ng of plasmids constructed by Tsingke Biotech was added to 100 µL of BL21(DE3) competent cells (TransGen Biotech, CD601‐02) thawed on ice, and mixed gently. After standing still on ice for 5 min, the tubes were incubated at 42 °C in a metal bath (Yeasen, ES‐MB1) for 60 s and immediately transferred on ice bath for 2 min. Subsequently, 700 µL of LB culture medium was added to each tube and mixed thoroughly. An amount of 150 µL of the resulting mixture was plated evenly on the LB Carb solid medium and incubated at 37 °C (Bluepard Instruments, LRH‐250) overnight to generate colonies (Figure S13). The plasmids were extracted from the cells to confirm the successful insertion of the information sequence (Figure S14). The codewords in the seven codon tables were aligned to form a single‐stranded DNA (ssDNA) sequence, with markers (ACATCTGGGGTCTACG) inserted at both ends of the two codon tables utilized for image encryption, resulting in two 968‐nt DNA sequences. These two codon tables were randomly selected in silico. The designed DNA sequences were synthesized by standard phosphoramidite chemistry, followed by enzyme‐assisted ligation, which was conducted by Tsingke Biotech. The DNA sequences were inserted into plasmids and further transformed into BL21 (DE23)‐competent cells for higher stability. Finally, the bacteria strains shown in Figure S12 were obtained.
Cell Passage
4.3
The growth curves were obtained by detecting OD600 values using a cell density meter (Biochrom, Ultrospec 10) for approximately 8 h at 15‐min intervals (Figure S15). After calculating the doubling time within the logarithmic growth period, cell passaging was performed conducted to evaluate the stability of data access in the cell‐based storage architecture. The as‐received glycerol stocks of bacteria were regarded as generation 0, and a specific amount of the stock was added to the LB amp culture medium to reach an OD600 of 0.05. Cells were cultivated at 37 °C, 200 rpm (Shanghai Minquan Instrument, LRH‐250), and the resulting culture was diluted into fresh LB amp medium when OD600 approached approximately 0.8. Cultivation and dilution were repeated until cell passages for 100 generations.
Plasmid Collection and Sequence Verification
4.4
Plasmids were collected and purified using a HiPure Plasmid EF Mini Kit (P1112‐02) following the manufacturer's instruction. Enzyme digestion was verified for plasmids evaluation. An amount of 500 ng plasmids was added to 2 µL of 10× NEBuffer, and 0.5 µL of BamHI (New England Biolabs, R0136V) was added to nuclease‐free water to prepare a 20‐µL reaction mix. The digestion was completed by incubating at 37 °C for 40 min. Plasmids were analyzed by gel electrophoresis before and after digestion.
PCR was conducted to test and amplify the information sequences using the M13F‐77 (GATGTGCTGCAAGGCGATTA) and M13R‐88 (TTATGCTTCCGGCTCGTATG) primers on a Biometra TRIO 48 Multi Block thermal cycler (Analytik Jena). The reaction mix was prepared as follows: 10 ng plasmids containing each information sequence were added to each tube and mixed thoroughly with 2 µL of 10× DreamTaq Buffer, 2 µL of dNTPs (0.2 mM of each), 1 µL of M13F‐77 and M13R‐88 primers (10 µM), 0.1 µL of 5‐U/µL DreamTaq HotStart DNA polymerase, and 2 µL of 10× DreamTaq Buffer. The mixture was brought up to 20 µL with nuclease‐free water and subjected to the following procedure: initial denaturation at 95 °C for 1 min, 35 cycles of denaturation at 95 °C for 30 s, annealing at 56 °C for 30 s, and extension at 72 °C for 30 s, with a final extension step at 72 °C for 5 min. The plasmids were verified by 1% agarose gel electrophoresis and Sanger sequencing. The PCR products were also analyzed using 1% agarose gel electrophoresis.
Author Contributions
J.X. and Y.W. contributed equally to this work. J.X. performed the investigation and was responsible for methodology, validation, visualization, and writing the original draft as well as review and editing. Y.W. contributed to methodology, software development, validation, visualization, and conceptualization, and participated in review and editing. M.L. contributed to validation. Y.W. conducted the formal analysis. L.W. provided resources. H.M. performed formal analysis. H.Z. contributed to supervision and funding acquisition. J.D. provided supervision. S.C. contributed to supervision and funding acquisition. X.H. was responsible for conceptualization, supervision, funding acquisition, and writing—review and editing.
Funding
National Key Research and Development Program of China (2022YFF0710800, 2022YFF0710801, 2022YFF0710802); National Natural Science Foundation of China (32201207); Shenzhen Medical Academy of Research and Translation (C2302001; B2302041); Natural Science Foundation of Guangdong province (2024A1515012923); Shenzhen Science and Technology Program (KQTD20180413181837372, RCYX20221008092950122, JCYJ20250604142426036), Shenzhen Clinical Research Center for Respiratory Disease (LCYSSQ20220823091203007), Shenzhen Key Laboratory of Respiratory Diseases (SYSPG20241211173920041), Innovation Program of Chinese Academy of Agricultural Sciences and Shenzhen Outstanding Talents Training Fund, Major Project of Guangzhou National Laboratory (GZNL2024A02003).
Conflict of Interest
The authors declare no other conflicts of interest.
Supporting information
Supporting File: advs73305‐sup‐0001‐SuppMat.docx.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1A. Wright , Worldwide IDC Global Data Sphere Forecast, 20242028: AI Everywhere, But Upsurge in Data Will Take Time, International Data Corporation, IDC Corporate 140 Kendrick Street Building B, Needham, MA 02494 2024.
- 2G. M. Church , Y. Gao , and S. Kosuri , “Next‐generation digital information storage in DNA,” Science 337 (2012): 1628.22903519 10.1126/science.1226355 · doi ↗ · pubmed ↗
- 3C. Zhang , R. Wu , F. Sun , et al.“Parallel molecular data storage by printing epigenetic bits on DNA,” Nature 634 (2024): 824–832.39443776 10.1038/s 41586-024-08040-5PMC 11499255 · doi ↗ · pubmed ↗
- 4K. Matange , J. M. Tuck , and A. J. Keung , “DNA stability: A central design consideration for DNA data storage systems,” Nature Communications 12 (2021): 1358.10.1038/s 41467-021-21587-5PMC 792110733649304 · doi ↗ · pubmed ↗
- 5C. K. Lim , S. Nirantar , W. S. Yew , and C. L. Poh , “Novel modalities in DNA data storage,” Trends in Biotechnology 39 (2021): 990–1003.33455842 10.1016/j.tibtech.2020.12.008 · doi ↗ · pubmed ↗
- 6N. Goldman , P. Bertone , S. Chen , et al.“Towards practical, high‐capacity, low‐maintenance information storage in synthesized DNA,” Nature 494 (2013): 77–80.23354052 10.1038/nature 11875 PMC 3672958 · doi ↗ · pubmed ↗
- 7X. Huang , Z. Hou , W. Qiang , et al.“Towards next‐generation DNA encryption via an expanded genetic system,” National Science Review 12 (2024): nwae 469.40160677 10.1093/nsr/nwae 469PMC 11951100 · doi ↗ · pubmed ↗
- 8Z. Hou , W. Qiang , X. Wang , et al.““Cell Disk” DNA Storage System Capable of Random Reading and Rewriting,” Advanced Science 11 (2024): 2305921.38332565 10.1002/advs.202305921 PMC 11022697 · doi ↗ · pubmed ↗
