# Pichia-CLM: A language model–based codon optimization pipeline for Komagataella phaffii

**Authors:** Harini Narayanan, J. Christopher Love

PMC · DOI: 10.1073/pnas.2522052123 · Proceedings of the National Academy of Sciences of the United States of America · 2026-02-17

## TL;DR

Pichia-CLM is a deep learning tool that improves protein production in Komagataella phaffii by optimizing codon usage based on the host genome, outperforming existing methods.

## Contribution

Pichia-CLM introduces a novel language model-based codon optimization pipeline that learns from the host genome and outperforms commercial tools experimentally.

## Key findings

- Pichia-CLM enhanced heterologous protein production up to threefold compared to native sequences.
- The model consistently outperformed four commercial codon optimization tools across diverse protein classes.
- Codon usage bias metrics poorly correlate with actual protein yields, highlighting their limitations.

## Abstract

This paper presents Pichia–Codon language model (Pichia-CLM), a deep learning–based language model for codon optimization to enhance recombinant protein production in the industrially relevant host, Komagataella phaffii. Unlike conventional approaches that rely on codon usage bias metrics (CUB)—often providing a global score and ignoring sequence context—Pichia-CLM leverages the host genome to unbiasedly learn the amino acid-to-codon mapping. Prior deep learning models have attempted codon optimization but typically evaluated performance using CUB metrics with limited experimental validation. In contrast, we experimentally validate Pichia-CLM across six diverse protein classes of varying complexity and consistently observe superior expression titers compared to four commercial codon optimization tools. Our findings also reveal the limitations of using CUB metrics, showing a poor correlation between these scores and observed protein yields.

The preference in synonymous codon usage—the so-called codon usage bias (CUB)—is governed by several factors such as the host organism, context and function of the gene, and the position of the codon within the gene itself. We demonstrated that this mapping can be learned from the host’s genome using language models and subsequently applied for codon optimization of heterologous proteins expressed by the host. This pipeline called Pichia–Codon language model (Pichia-CLM) was applied to the industrial host organism, Komagataella phaffii. With this approach, production of heterologous proteins was enhanced up to threefold compared to their native sequences. Furthermore, Pichia-CLM consistently yielded constructs with enhanced productivity for proteins of varied complexity, compared to commercially available tools. Finally, we showed that Pichia-CLM generates sequences resembling the properties of codon usage found in the host’s intrinsic host cell proteins and learned features such as avoiding negative cis-regulatory and repeat elements based on patterns in the genome data. These results show the potential of language models to unbiasedly learn patterns and design robust sequences for improved protein production.

## Linked entities

- **Species:** Komagataella phaffii (taxon 460519)

## Full-text entities

- **Genes:** GH1 (growth hormone 1) [NCBI Gene 2688] {aka GH, GH-N, GHB5, GHN, IGHD1A, IGHD1B}, RPN1 (ribophorin I) [NCBI Gene 6184] {aka OST1, RBPH1}, AOX1 (aldehyde oxidase 1) [NCBI Gene 316] {aka AO, AOH1}, ALB (albumin) [NCBI Gene 213] {aka FDAHT, HSA, PRO0883, PRO0903, PRO1341}, TPO (thyroid peroxidase) [NCBI Gene 7173] {aka MSA, TDH2A, TPX}, TSLIG1 (tRNA splicing ligase complex subunit 1) [NCBI Gene 339487] {aka ARCH, ARCH2, ZBTB8OS}
- **Chemicals:** agarose (MESH:D012685), GC (MESH:C057580), Alcohols (MESH:D000438), glycerol (MESH:D005990), Azenta (-), Amino Acids (MESH:D000596), Amide (MESH:D000577), Glycine (MESH:D005998), SDS (MESH:D012967), acid (MESH:D000143), methanol (MESH:D000432), Zeocin (MESH:C105427), carbon (MESH:D002244), trastuzumab (MESH:D000068878), PNAS (MESH:D020135), potassium phosphate (MESH:C013216)
- **Species:** Bos taurus (bovine, species) [taxon 9913], Komagataella phaffii GS115 (strain) [taxon 644223], Homo sapiens (human, species) [taxon 9606], Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Komagataella pastoris (species) [taxon 4922], Komagataella phaffii (species) [taxon 460519], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Mus musculus (house mouse, species) [taxon 10090]
- **Mutations:** D196E
- **Cell lines:** K. phaffii — Clarias batrachus (Walking catfish), Spontaneously immortalized cell line (CVCL_S935), CHO — Cricetulus griseus (Chinese hamster), Spontaneously immortalized cell line (CVCL_0213)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12933070/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12933070/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12933070/full.md

---
Source: https://tomesphere.com/paper/PMC12933070