ColiFormer: A Transformer-Based Codon Optimization Model Balancing Multiple Objectives for Enhanced E. coli Gene Expression
Saketh Baddam, Omar Emam, Abdelrahman Elfikky, Francesco Cavarretta, George Luka, Ibrahim Farag, Yasser Sanad

TL;DR
ColiFormer is a new codon optimization tool that improves gene expression in E. coli by balancing multiple biological factors.
Contribution
ColiFormer introduces a transformer-based framework that balances multiple codon optimization objectives using a novel mathematical approach.
Findings
ColiFormer improved CAI and tAI values while maintaining optimal GC content in silico.
The model reduced inhibitory cis-regulatory motifs compared to existing methods.
ColiFormer maintains competitive runtime performance while achieving these improvements.
Abstract
Codon optimization is widely used to improve heterologous gene expression in Escherichia coli. However, many existing methods focus primarily on maximizing the codon adaptation index (CAI) and neglect broader aspects of biological context. In this study, we present ColiFormer, a transformer-based codon optimization framework fine-tuned on 3676 high-expression E. coli genes curated from the NCBI database. Built on the CodonTransformer BigBird architecture, ColiFormer employs self-attention mechanisms and a mathematical optimization method (the augmented Lagrangian approach) to balance multiple biological objectives simultaneously, including CAI, GC content, tRNA adaptation index (tAI), RNA stability, and minimization of negative cis-regulatory elements. Based on in silico evaluations on 37,053 native E. coli genes and 80 recombinant protein targets commonly used in industrial studies,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · RNA modifications and cancer · Protein Structure and Dynamics
1. Introduction
Since the seminal breakthroughs of Chang and Cohen (1974) and Morrow et al. (1974), heterologous gene expression has become a cornerstone of modern biotechnology [1,2]. The expression of recombinant proteins in a heterologous host has a wide range of applications, including therapeutic protein production, vaccine manufacturing, diagnostic assays in biomedical fields, bioremediation in industrial fields, and food processing [3,4,5,6]. In particular, Escherichia coli is widely used for recombinant protein production; for example, it can be engineered to secrete heterologous proteins efficiently using its flagellar type III secretion system [7,8].
The recombinant protein therapeutics market reached 5.58 billion by 2030 [9]. However, codon bias can reduce expression yield and protein quality. In several heterologous systems, expression levels fall below 0.1 mg per 100 mL of culture [10]. The genetic code is redundant, with 61 sense codons encoding 20 amino acids. As a result, multiple synonymous codons can encode the same protein. Despite this redundancy, organisms preferentially use specific synonymous codons, a phenomenon known as codon bias. This bias arises due to differences in tRNA (Transfer Ribonucleic Acid) abundance and the efficiency of the translation machinery [11].
Rare codons, despite their potential to slow translation, play a crucial role in enhancing protein folding efficiency by introducing translational pauses that allow nascent polypeptides to correctly adopt their functional conformations. Conversely, abundant codons promote faster and more accurate translation rates. This can improve overall protein yield but sometimes at the cost of misfolding or aggregation due to insufficient translational folding time [12].
Codon harmonization solutions attempt to resolve this dilemma by recreating the native gene’s codon usage patterns in the heterologous host. Rather than just increasing the use of abundant codons, harmonization balances synonymous codon selection to improve translational kinetics. This allows accurate folding and functional protein expression [13]. Aligning with codon bias, the selection of synonymous codons that are more commonly used in the host genome can significantly enhance protein expression, with improvements ranging from 2- to 15-fold in E. coli [14].
However, despite their potential to slow translation, rare codons can enhance the folding efficiency of the protein, allowing proper protein expression and functional conformation [15]. Other studies observe that codon bias—while it increases expression—introduces unintended issues—cellular toxicity, tRNA imbalance, and metabolic stress—which ultimately hinder protein production, solubility, and stability [12,16]. This trade-off between rare and abundant codons makes it necessary to strategically use codon optimization for recombinant genes.
Codon optimization seeks to balance codon usage preferences with protein production goals to maximize yield without compromising the quality and functionality of the expressed protein. It involves strategically modifying codons to align with the host organism’s preferences while preserving the original amino acid sequence [17]. However, many existing algorithms focus predominantly on maximizing protein production and CAI (Codon Adaptation Index), neglecting the biological context and characteristics of the gene [12]. Furthermore, many methods do not balance multiple objectives or capture long-range sequence dependencies. This limitation can disrupt tRNA pools and lead to cellular toxicity, impaired protein folding, and reduced function [10,14].
As an alternative to existing approaches, we introduce ColiFormer, an E. coli-specific transformer-based codon optimizer that balances codon adaptation with learned biological context to optimize heterologous expression predictions in E. coli, built upon the BigBird CodonTransformer architecture [18]. Fine-tuned on 3676 high-CAI E. coli genes curated from NCBI (National Center for Biotechnology Information), ColiFormer leverages the power of self-attention and fine-tuning to adapt from general sequence understanding to E. coli-specific codon usage, accounting for the complexity of codon bias within the host cell. It enables simultaneous control of multiple objectives (e.g., CAI, GC content, tAI) through Lagrangian optimization, outperforming current AI tools such as NCRF (Neural Codon Random Forest), CodonTransformer, and CodonGPT—as demonstrated in later sections [17,18,19]. Based on comprehensive in silico benchmarks, it offers a promising AI-driven solution for industrial gene synthesis and therapeutic protein production. While experimental validation is required to confirm these predictions in vivo, current metrics suggest that ColiFormer significantly improves sequence parameters relative to existing methods.
Rather than explicitly retaining rare codons at predefined positions, ColiFormer approaches codon selection through data-driven context awareness. By training on high-expression genes, the model prioritizes codon abundance (high CAI) to maximize yield; however, its Transformer architecture simultaneously learns to avoid toxic local motifs and preserve RNA stability patterns inherent in the training data. This allows the model to achieve high adaptation indices while implicitly respecting biological constraints that simple maximization algorithms often violate.
To evaluate ColiFormer’s performance, we benchmarked it on 80 industry-relevant genes derived from past recombinant protein expression studies, alongside 37,053 E. coli genes. ColiFormer’s optimized outputs were compared against six established methods: the original (unoptimized) sequence, IDT, Extended Random Choice (ERC), CodonGPT, NCRF, and CodonTransformer. Optimization performance was quantified using multiple in silico metrics, including CAI, GC content (Guanine–Cytosine content), negative cis-regulatory elements, RNA stability, tAI (tRNA Adaptation Index), success rate, and algorithm run-time.
2. Materials and Methods
2.1. Database and Benchmark Set
With 349,154 E. coli genomes, each one containing ~4400 genes, the National Center for Biotechnology Information’s (NCBI) GenBank database was used to collect the E. coli gene sequences [20]. Then, CD-HIT-EST [21] was used to cluster highly similar sequences that shared more than 90% sequence identity, thereby reducing redundancy and resulting in 41,435 unique gene sequences listed in Supplementary File S1. A 90% sequence identity cutoff was used to reduce redundancy and prevent data leakage between training and evaluation sets, ensuring that the model learns generalizable patterns rather than memorizing highly similar sequences. Furthermore, to mitigate potential biases inherent in public databases, we implemented a rigorous filtration pipeline. This included the removal of sequences with internal stop codons and the verification of correct start and stop codon placement, as shown in Figure 1, effectively filtering out pseudogenes and incomplete entries. To focus on highly expressed genes, the remaining sequences were ranked based on their CAI, calculated using a standard reference set of highly expressed E. coli genes. Listed in Supplementary File S2, the top 3676 high-CAI sequences were selected as the final training database, reflecting the typical codon-usage patterns and native characteristics of highly expressed E. coli genes. The remaining 37,053 gene sequences were reserved to evaluate ColiFormer’s ability to optimize native E. coli sequences.
As a benchmark set, 80 DNA sequences were compiled from published datasets and studies focused on heterologous protein production in E. coli. These genes cover a broad range of functional categories to ensure the fine-tuned model’s generalizability. This benchmark set serves as a validation for the tool’s effectiveness in optimizing well-established recombinant targets. These 80 DNA sequences, along with their sources, are provided in Supplementary File S3.
2.2. CodonTransformer Model Architecture
CodonTransformer is a multispecies deep learning model initially trained on over 1 million DNA–protein pairs from 164 organisms across all domains of life [18]. It employs the Transformer architecture with advanced context-awareness, using a sequence representation method called STREAM (Shared Token Representation and Encoding with Aligned Multi-masking). STREAM encodes both codon and amino acid identity per token, enabling the model to capture organism-specific codon preferences while preserving protein-level constraints. CodonTransformer study shows that this representation generalizes well to organisms not included in the training dataset, due to the model’s reliance on the codon-usage patterns learned across diverse species.
The model’s base architecture includes 89.6 million parameters built upon the BigBird Transformer variant, comprising 12 hidden layers with 768 hidden dimensions, 12 attention heads, and a block sparse attention mechanism with a block size of 64. BigBird was selected for its efficiency in modeling long-range dependencies in long, extensive genetic sequences, which is essential for capturing codon context [22]. This architecture enables efficient processing of sequences up to 2048 tokens while maintaining computational tractability through sparse attention patterns. The model’s 2048-token maximum input length is not a constraint here, since E. coli genes are shorter than this threshold [23]. Figure 2 illustrates the ColiFormer architecture built upon CodonTransformer, highlighting token embeddings, frozen and trainable transformer blocks, and multi-objective sequence evaluation and generation methods.
A key innovation of CodonTransformer is its dual representation strategy, where each token simultaneously encodes amino acid identity and codon information. Its vocabulary consists of 90 distinct tokens, including 64 amino acid–codon combinations, 20 masked amino acid tokens (a_unk through y_unk), 3 stop tokens, and 5 special tokens ([UNK], [CLS], [SEP], [PAD], [MASK]) for sequence processing. This representation enables the model to simultaneously learn amino acid sequence constraints and organism-specific codon usage patterns through masked language modeling objectives.
2.3. Dataset Preparation and Curation
For E. coli-specific optimization, we curated a high-quality dataset from multiple sources, selecting 3676 genes with elevated CAI values for model fine-tuning. This dataset primarily includes E. coli genes with documented high expression characteristics, validated against an established reference database of high-expression genes. Each sequence underwent rigorous validation to ensure structural integrity, including verification of proper start codons (ATG, TTG, CTG, or GTG; present in their natural frequencies), appropriate stop codon termination (TAA, TAG, or TGA), absence of internal stop codons, sequence length divisibility by three, and exclusive composition of standard nucleotides (A, T, G, C). All sequences passed these validation checks, with none excluded.
The validated sequences were processed into the CodonTransformer input format, where protein sequences were represented as masked token sequences [24]. For instance, a protein sequence “MALW” would be initially represented as “M_UNK A_UNK L_UNK W_UNK”, with UNK indicating unknown codon assignments. During training, the model learns to predict the optimal codons at each position based on surrounding sequence context and E. coli usage patterns. During inference, these masked tokens are resolved into context-optimized codons using maximum probability decoding or beam search methods. All sequences were tagged with organism ID 51, corresponding to “Escherichia coli general” in the CodonTransformer organism registry. The data preprocessing pipeline, including raw sequence filtering, deduplication, feature extraction, and preparation of inputs for model training, is summarized in Figure 1.
2.4. Fine-Tuning Methodology with Enhanced GC Control
A critical innovation in ColiFormer is the implementation of an enhanced augmented Lagrangian method to achieve precise control of GC content during fine-tuning [25]. In E. coli, maintaining an optimal GC content of around 52% is crucial, as deviations can negatively impact gene expression efficiency and cellular fitness, as shown in Figure 2. Deviations from optimal GC content can disrupt mRNA secondary structure. They may also impair ribosomal binding and reduce transcript stability. Traditional codon optimization approaches often struggle to maintain appropriate GC balance while optimizing for other parameters, resulting in sequences with suboptimal expression.
To address this, we formulated the optimization problem as a constrained learning task where the model minimizes masked language modeling loss while enforcing GC content constraints. The total loss function is formulated as:
Here, denotes the masked language modeling loss. When constraint terms are weak, CAI increases, but GC content may drift from the target. λ represents the Lagrangian multiplier; it corrects directional bias by shifting probabilities toward lower-GC codons when the GC content exceeds the target, or toward higher-GC codons when it is below the target. GC is the predicted GC content calculated through differentiable operations, and μ equals 0.52 as the target GC content for E. coli. is the adaptive penalty coefficient that controls the strictness of GC adherence.
The MLM loss is defined as follows:
Here, is the gold codon at position ; this is the true codon to be predicted at the masked location. is the total number of masked positions in the input sequence. is the sequence context excluding position ; all codons are in the input except for the masked codon at . is the probability, according to the model’s parameters , of the gold codon occurring at position given the masked input context .
The GC content calculation leverages a differentiable lookup table mapping each token in the vocabulary to its corresponding GC fraction. For any predicted probability distribution over tokens, the expected GC content is computed as the weighted sum of individual token GC values. This differentiable formulation enables gradient-based optimization while maintaining biological constraints. To ensure robust and stable convergence, ColiFormer implements a self-tuning mechanism for the penalty coefficient , which adaptively adjusts the strictness of GC content constraints during training. The deviation from the target GC content is defined as , where is the smoothed, differentiable GC estimate and is the desired GC value. The violation at the step is quantified as . After an initial curriculum warm-up phase (first 3 epochs), the loss incorporates augmented Lagrangian terms, and both the Lagrange multiplier ( ) and penalty coefficient ( ) undergo scheduled updates every 20 steps. Specifically, at each scheduled update, the dual variable is updated via:
Meanwhile, the penalty coefficient is adaptively increased when constraint violations persist, using the following rule:
where is the violation tolerance, is the penalty update factor, is the relative improvement threshold, and set allowed bounds for . Otherwise, . All parameters and violations are logged at each step.
This adaptive update mechanism tightens the GC constraint only when the model’s progress toward the target GC is inadequate, stabilizing when the target is closely matched. As a result, ColiFormer maintains precise control over sequence composition, allowing robust optimization even when codon context and GC control are in conflict.
While the augmented Lagrangian term explicitly optimizes GC content, other critical biological objectives such as RNA stability and the avoidance of negative cis-regulatory elements are optimized implicitly. The Transformer architecture learns to avoid these deleterious motifs by training on high-quality, native E. coli sequences where such patterns are naturally depleted. Additionally, during the inference phase, the constrained beam search decoding algorithm explicitly tracks these parameters, pruning candidate sequences that violate defined biological thresholds, thus ensuring that optimized sequences satisfy multiple objectives beyond just GC content.
2.5. Training Configuration and Implementation
Training was conducted on NVIDIA GPU (NVIDIA Corporation, Santa Clara, CA, USA) infrastructure with 16 GB VRAM (Video Random Access Memory (GPU memory)) using mixed precision computation to maximize efficiency [26]. The software environment consisted of Python 3.10.12 (Python Software Foundation, Wilmington, DE, USA), PyTorch 2.1.0 (Meta Platforms, Inc., Menlo Park, CA, USA), CUDA Toolkit 11.8 (NVIDIA Corporation, Santa Clara, CA, USA), NumPy 1.24.4 and SciPy 1.11.4. The model was fine-tuned for 15 epochs with a batch size of 6 sequences, employing the AdamW optimizer with a learning rate of 5 × 10^−5^. We implemented a cosine annealing schedule with warm restarts, where the learning rate periodically returns to its initial value before decaying again, promoting exploration of the parameter space and preventing premature convergence to suboptimal solutions, following the SGDR framework [27].
The number of fine-tuning epochs (15) was selected based on early convergence of (i) the validation masked language modeling loss and (ii) the GC-content constraint violation metric (V = ∣GC − μ∣). During training, both metrics were monitored at each epoch. We observed that the loss and GC violation typically plateaued, showing less than 0.5% improvement by epoch 13, indicating satisfactory convergence without overfitting. To ensure complete convergence across all runs, we set the total number of epochs to 15.
A curriculum learning strategy was employed where GC constraints were enforced only after a 3-epoch warm-up period. This approach allows the model to first adapt to E. coli-specific codon usage patterns before introducing the additional constraint of GC content optimization. During the warm-up phase, the model focuses solely on minimizing masked language modeling loss, learning the intrinsic patterns of high-expression E. coli genes. After the curriculum period, the augmented Lagrangian terms are activated, guiding ColiFormer toward sequences that satisfy both expression optimization and GC content requirements. Validation loss plateaued at epoch 13, indicating convergence without overfitting.
The training process incorporated extensive monitoring of both standard metrics and constraint satisfaction. The Lagrangian multiplier λ was updated every 20 optimization steps based on the current constraint violation, while the penalty coefficient ρ underwent adaptive adjustment based on the relative improvement in constraint satisfaction. This dual update mechanism ensures robust convergence even for difficult optimization landscapes where codon preferences and GC constraints may conflict.
2.6. Sequence Generation and Inference Strategies
The fine-tuned model supports multiple sequence generation strategies to accommodate different objectives and constraints. For standard optimization tasks, deterministic decoding selects the highest probability codon at each position, producing consistent and reproducible sequences. For applications requiring sequence diversity, stochastic sampling with temperature scaling enables the generation of multiple distinct sequences while maintaining optimization quality [28]. A key feature is the implementation of a constrained beam search algorithm that maintains multiple candidate sequences simultaneously while enforcing strict GC content bounds throughout the generation process. This approach prevents the overuse of high-frequency codons and ensures the biological viability of optimized sequences. Compared to standard beam search, this constrained version maintains similar computational cost while ensuring that candidates stay within biological limits.
The constrained beam search algorithm incorporates position-aware GC tracking, where each partial sequence maintains its cumulative GC count and projected final GC content. Candidate expansions that would violate the specified GC bounds are pruned from the search space, ensuring all generated sequences satisfy the biological constraints. In practice, increasing beam size expands the search space and can improve sequence quality, but it also raises compute time, which is why moderate sizes work well for routine optimization [29]. The algorithm employs a beam size of 5 (gives a strong quality/latency trade-off for routine use) with a length penalty factor of 1.0 (defaults), though beam size 10 and length penalty 1.2 are shown in examples to balance sequence quality with computational efficiency. These values were chosen because they provided stable performance in testing while keeping runtime low [30]. Full decoding configurations could be found in the open-source implementation of ColiFormer. For E. coli optimization, we enforce GC bounds of 50–56%, providing a ±2% tolerance around the optimal value to reflect natural sequence variability.
2.7. Statistical Analysis
For our metric, we report the mean ± standard deviation (SD). Comparisons between the original sequences and each optimization method were assessed with paired two-sided t-tests across matched genes. We also report p values (the probability of observing t at least this extreme if the true mean difference were zero), where t is a number that measures how far the observed mean difference is from zero in standard error units. For NCRF, the sample size is n = 61 due to its sequence length constraints; all other methods use n = 80. A significance threshold of α = 0.05 was used.
2.8. Code Availability
The open-source implementation of ColiFormer is publicly available at: https://github.com/SAKETH11111/ColiFormer (accessed on 12 December 2025). To facilitate broader adoption, we also offer a user-friendly interface for optimizing E. coli sequences, available at: https://huggingface.co/spaces/saketh11/ColiFormer (accessed on 12 December 2025).
3. Results
3.1. Optimizing Native E. coli Genes
To determine ColiFormer’s ability to optimize both native E. coli sequences and established recombinant targets, we benchmarked it on 37,053 native E. coli genes alongside 80 industry-relevant genes, shown in Supplementary File S3, derived from past recombinant protein expression studies. In addition, to compare ColiFormer’s performance to current approaches, we evaluated it against established methods, including IDT, ERC, NCRF, CodonGPT, and CodonTransformer.
After benchmarking on the native E. coli genes, ColiFormer demonstrated a significant improvement in the CAI, which measures how closely a gene’s codon usage matches that of highly expressed genes in a given organism, making it a strong predictor of gene expression levels in vivo [31]. It is important to note that while in silico metrics such as CAI and tAI are widely used indicators of expression potential, they are computational proxies. Higher scores in these metrics predict, but do not guarantee, improved protein yield in vivo. The baseline CAI of 0.689 represents the average across the 37,053 non-high-expression E. coli genes used for evaluation. The average CAI increased from 0.689 in the original sequences to 0.904 after optimization, representing an approximate 31% improvement. This enhancement strongly suggests greater codon usage alignment with highly expressed genes, indicating likely improvements in recombinant protein yield and cellular expression efficiency. The CAI increase is shown in Figure 3.
Regarding the other key parameters affecting expression quality [31,32], ColiFormer maintained the GC content within the optimal range, increasing it modestly from 51.24% to 55.96%. The increase to 55.96% remains within the biologically preferred 50–56% range for E. coli, avoiding extremes that affect expression efficiency. Importantly, the model reduced the incidence of negative cis-regulatory elements from 0.162 to 0.068. The tAI also increased from 0.303 to 0.373, further indicating stable protein expression. All individual values for each sequence before and after optimization across these key parameters are provided in Supplementary File S4 for further analysis.
3.2. Optimizing Established Recombinant Targets
To contextualize ColiFormer’s performance in optimizing established recombinant targets, IDT, CodonTransformer, NCRF, and ERC were used to optimize the benchmark set of the 80 industry-relevant genes. Those 80 genes have an average CAI of 0.58 ± 0.081. ColiFormer’s optimization yielded a CAI of 0.901 ± 0.0278 (p < 1 × 10^−16^), satisfying a 55.3% improvement compared to the original. ERC showed a modest improvement to 0.645 ± 0.083 (+11% over original); IDT optimization achieved an average CAI of 0.754 ± 0.033 (+30%); CodonTransformer reached 0.871 ± 0.029 (p < 1 × 10^−16^) (+50%); NCRF optimization yielded a CAI of 0.9634 (p < 1 × 10^−16^) (+66%). Although NCRF produced the highest average CAI of 0.9634 ± 0.0095, its success rate was only 76.3%. The 76.3% success rate reflects the model’s inability to process sequences longer than ~800 amino acids due to input token limitations, with failures concentrated in 19 large proteins averaging ~830 amino acids. The mean CAI for all approaches, along with individual gene-level values, is shown in Figure 4.
Across the models, the optimal GC content range is generally accepted between 45 and 65%, with deviations outside this range negatively impacting molecular processes and gene performance [33,34]. Specifically, the E. coli genome shows an optimal GC content between 52% and 56% [15,35,36]. ColiFormer maintains an average GC content of 56.38% ± 7.56% (p = 9.7 × 10^−10^), very close to the optimal value, well within this accepted range, along with the other models. The mean GC Content for all approaches, along with individual gene-level values, is shown in Figure 5.
The tRNA Adaptation Index measures how well a gene’s codon usage matches the abundance and efficiency of the cellular tRNA pool, with higher tAI values indicating better translational efficiency [31]; ColiFormer achieved a high average tAI of 0.3628 ± 0.032 (p < 1 × 10^−16^) (27% over the original), notably higher than the achieved tAI of NCRF and CodonTransformer. The mean tAI for all approaches, along with individual gene-level values, is shown in Figure 6.
RNA stability was assessed by calculating the change in folding energy (Δ5′ MFE) over the initial 50 nucleotides. This metric impacts ribosome binding and translation initiation efficiency, where lower values correspond to enhanced overall translational efficiency [37]. ColiFormer achieved a Δ5′ MFE of 0.22 kcal/mol, representing the closest value to the native folding energy of the original sequences, indicating the highest preservation of native RNA folding stability (as measured by Minimum Free Energy, MFE). This performance exceeded that of CodonTransformer (0.31 kcal/mol) and ERC (0.70 kcal/mol). In contrast, NCRF exhibited the lowest mean Δ5′ MFE (−2.49 kcal/mol), suggesting the strongest structural stabilization, while IDT also lowered the 5′ MFE (−0.40 kcal/mol) relative to the original sequence baseline (0.00 kcal/mol). The mean change in folding energy (Δ5′ MFE) over the first 50 nucleotides for all approaches is shown in Figure 7.
Negative cis-regulatory elements act as inhibitory sequences that can severely repress gene expression, and their decrease is beneficial for effective transcription and translation [38]. They include known inhibitory motifs such as RBS occlusion sequences and premature poly-U tracts. ColiFormer demonstrated the lowest average negative cis elements of 0.0625, substantially lower than the other techniques, with values ranging from 3.06 to 5.26. This highlights ColiFormer’s strength in balancing multiple critical factors. The mean negative cis-regulatory elements for all approaches are shown in Figure 8.
The run time was calculated for ColiFormer and the ERC, CodonTransformer, and NCRF techniques on the benchmark set of 80 sequences. However, the average runtime for IDT was not calculated because it is a web-based tool rather than a standalone software or code that can be timed directly. The scores are displayed in Table 1.
In summary, our comprehensive experiments demonstrated that ColiFormer consistently achieves superior performance compared to existing codon optimization methods. This advantage is reflected in improved benchmark metrics, including CAI, tAI, and GC content control, motivating the discussion of ColiFormer’s innovations and impact in the following section. All individual readings and optimization results for each technique on the benchmark set of 80 sequences are provided separately as follows: ColiFormer in Supplementary File S5, CodonTransformer in Supplementary File S6, NCRF in Supplementary File S7, ERC in Supplementary File S8, and IDT in Supplementary File S9. Additional references supporting the benchmark datasets and recombinant gene sources are provided in Supplementary File S3.
4. Discussion
This study introduced ColiFormer, a transformer-based codon optimization model specifically fine-tuned for E. coli, that integrates multi-objective optimization to enhance recombinant gene expression. Across all benchmarks, ColiFormer outperformed existing tools, achieving higher CAI, improved tAI, and substantially fewer negative cis-regulatory elements, while maintaining biologically appropriate GC ranges. These results highlight ColiFormer’s ability to improve predicted expression without imposing undue cellular burden, demonstrating that effective codon optimization requires balancing expression efficiency with biological compatibility.
A key innovation of ColiFormer is its augmented Lagrangian approach for GC-content control. This method directly incorporates GC constraints into the model’s objective function rather than relying on post hoc filtering. This integrated strategy enables dynamic adjustment of codon probabilities based on projected GC deviation, providing tighter and more biologically grounded control, a feature that is largely absent in existing models. In addition to its training-time advantages, the constrained beam search used during sequence generation improves reliability compared with standard beam search by preventing GC drift under strong optimization pressure.
A central biological insight of this work is that codon optimization must be balanced with cellular compatibility to preserve natural translation dynamics. Over-optimization toward highly abundant codons can disrupt natural ribosomal pausing and induce proteotoxic stress. These effects impair co-translational folding and reduce soluble protein yield. Prior studies show that harmonization of rare codons preserves essential pauses required for domain-wise folding and mitigates cellular stress responses. By explicitly modeling both optimization and harmonization, ColiFormer avoids the classical trade-off where high CAI sequences exhibit misfolding or metabolic burden. This supports emerging evidence that biologically informed codon design is essential for maintaining translational homeostasis and cell viability.
An additional strength of ColiFormer lies in its augmented Lagrangian GC-control strategy, which provides practical and reproducible advantages over conventional codon optimization tools. Most existing approaches either ignore GC content entirely or apply simple post hoc filters that do not interact with the optimization process. In contrast, ColiFormer integrates GC constraints directly into training and sequence generation, allowing codon choices to be dynamically adjusted based on projected GC deviation. This results in tighter, more biologically relevant control compared to rule-based or heuristic methods, which often drift toward extreme GC content values when aggressively optimizing CAI.
As outlined earlier in the Introduction, we initially intended to benchmark ColiFormer against the RL (Reinforcement Learning)-optimized variant of CodonGPT, which reportedly achieves CAI values between 79 and 84% [19]. However, the RL-trained CodonGPT model has not been made publicly available, precluding direct benchmarking. Despite this, ColiFormer achieves CAI scores higher than those reported for CodonGPT across our test set.
From a computational perspective, ColiFormer demonstrates a low computational complexity, optimizing sequences in approximately 0.5 s each. This remarkable processing speed is a major advantage over many existing methods that require considerably longer runtimes, enabling highly scalable workflows suitable for industrial and research applications requiring high-throughput gene design.
Our evaluation framework incorporates a comprehensive set of in silico metrics. However, these metrics do not fully capture all factors influencing gene expression. Recent research highlights the potential to further enhance translation dynamics modeling by integrating metrics like elongation rate and tRNA adaptation, which could provide more nuanced insights into translational efficiency and protein folding [39]. ColiFormer’s architecture and output are well-suited for such extensions in future work. Because the augmented Lagrangian objective treats each constraint as an independent, differentiable term, it can readily incorporate additional sequence features. Likewise, the constrained decoding procedure can be extended to enforce any property, such as maintaining desired mRNA secondary structure or controlling the usage of rare codons, enabling broader constraint-aware sequence design in future versions of ColiFormer.
Looking ahead, ColiFormer can be extended in several impactful ways. Incorporating models of elongation rates and codon-pair context would capture translation dynamics more accurately, while integrating structural protein modeling could link codon usage to folding stability and function. Likewise, experimental validation of optimized sequences will be essential to confirm computational predictions and guide further refinement in future studies.
Regarding codon usage metrics, the CAI used in this study is functionally equivalent to the Codon Similarity Index (CSI), which applies CAI genome-wide to avoid bias from selecting specific reference genes. This ensures optimization reflects overall E. coli codon usage rather than overemphasizing highly expressed genes, which can sometimes disrupt folding fidelity. However, CAI/CSI have limitations: they assume that codon patterns in highly expressed genes always predict optimal translation, which overlooks codon-pair effects and mRNA structural constraints.
ColiFormer’s multi-objective design helps mitigate these issues and remains extensible for future integration of more mechanistic translation metrics. Although ColiFormer is currently trained exclusively on E. coli genes, its methodology is broadly adaptable for other host organisms, including yeast and mammalian cells. Transfer learning approaches can leverage pre-trained weights and fine-tune the model using organism-specific data, accelerating adaptation.
Despite these promising results, several limitations must be acknowledged. First and foremost, this study presents a computational proof of concept. While in silico metrics like CAI and tAI are strong predictors of expression potential, they serve as proxies and cannot fully replace biological validation. Experimental verification of protein yield and solubility in E. coli is the necessary next step to confirm these computational predictions. Furthermore, while ColiFormer preserves RNA folding stability (as evidenced by our MFE analysis) and avoids negative cis elements, it does not explicitly model translational pausing at specific rare codons, which some studies suggest is vital for co-translational folding of complex proteins.
In this study, harmonization is therefore achieved implicitly through the preservation of native-like sequence context and folding-related features rather than through explicit modeling of ribosomal pauses. Future work will focus on in vivo validation of the benchmarked sequences and the integration of explicit elongation rate models to further refine the balance between translation speed and folding accuracy.
Finally, to ensure transparency and reproducibility, we emphasize the importance of making the ColiFormer model and its codebase publicly available. Open access to these resources allows independent verification of results and supports broader adoption of the methods introduced in this study. By enabling the community to retrain and extend ColiFormer, we help establish stronger standards for codon optimization research and ensure long-term usability of the tool.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Chang A.C.Y. Cohen S.N. Genome Construction Between Bacterial Species In Vitro: Replication and Expression of Staphylococcus Plasmid Genes in Escherichia coli Proc. Natl. Acad. Sci. USA 1974711030103410.1073/pnas.71.4.10304598290 PMC 388155 · doi ↗ · pubmed ↗
- 2Morrow J.F. Cohen S.N. Chang A.C.Y. Boyer H.W. Goodman H.M. Helling R.B. Replication and Transcription of Eukaryotic DNA in Escherichia coli Proc. Natl. Acad. Sci. USA 1974711743174710.1073/pnas.71.5.17434600264 PMC 388315 · doi ↗ · pubmed ↗
- 3Watts A. Sankaranarayanan S. Watts A. Raipuria R.K. Optimizing Protein Expression in Heterologous System: Strategies and Tools Meta Gene 20212910089910.1016/j.mgene.2021.100899 · doi ↗
- 4Swiech K. Picanço-Castro V. Covas D.T. Human Cells: New Platform for Recombinant Therapeutic Protein Production Protein Expr. Purif.20128414715310.1016/j.pep.2012.04.02322580292 · doi ↗ · pubmed ↗
- 5Keasling J.D. Bang S.-W. Recombinant DNA Techniques for Bioremediation and Environmentally-Friendly Synthesis Curr. Opin. Biotechnol.1998913514010.1016/S 0958-1669(98)80105-59664056 · doi ↗ · pubmed ↗
- 6Liu M. Xiao R. Li X. Zhao Y. Huang J. A Comprehensive Review of Recombinant Technology in the Food Industry: Exploring Expression Systems, Application, and Future Challenges Compr. Rev. Food Sci. Food Saf.202524 e 7007810.1111/1541-4337.7007839970011 · doi ↗ · pubmed ↗
- 7İncirİ. KaplanÖ. Escherichia coli as a Versatile Cell Factory: Advances and Challenges in Recombinant Protein Production Protein Expr. Purif.202421910646310.1016/j.pep.2024.10646338479588 · doi ↗ · pubmed ↗
- 8Green C.A. Kamble N.S. Court E.K. Bryant O.J. Hicks M.G. Lennon C. Fraser G.M. Wright P.C. Stafford G.P. Engineering the Flagellar Type III Secretion System: Improving Capacity for Secretion of Recombinant Protein Microb. Cell Factories 2019181010.1186/s 12934-019-1058-4PMC 633778430657054 · doi ↗ · pubmed ↗
