# Advancing codon language modeling with synonymous codon constrained masking

**Authors:** James Heuschkel, Laura Kingsley, Noah Pefaur, Andrew Nixon, Steven Cramer

PMC · DOI: 10.1093/nar/gkag166 · Nucleic Acids Research · 2026-02-25

## TL;DR

A new codon language model called SynCodonLM improves DNA sequence modeling by focusing on nucleotide-level patterns through a constraint that limits predictions to synonymous codons.

## Contribution

SynCodonLM introduces a novel constraint that enforces synonymous codon prediction, disentangling codon usage from protein semantics.

## Key findings

- SynCodonLM clusters codons by nucleotide properties rather than amino acid identity.
- The model outperforms existing approaches on six of seven DNA-level benchmarks.
- The approach enables better representation learning for DNA-level biology and synthetic biology applications.

## Abstract

Codon language models offer a promising framework for modeling protein-coding DNA sequences, yet current approaches often conflate codon usage with amino acid semantics, limiting their ability to capture DNA-level biology. We introduce SynCodonLM, a codon language model that enforces a biologically grounded constraint: masked codons are only predicted from synonymous options, guided by the known protein sequence. This design disentangles codon-level from protein-level semantics, enabling the model to learn nucleotide-specific patterns. The constraint is implemented by masking non-synonymous codons from the prediction space prior to softmax. Unlike existing models, which cluster codons by amino acid identity, SynCodonLM clusters by nucleotide properties, revealing structure aligned with DNA-level biology. Furthermore, SynCodonLM outperforms existing models on six of seven benchmarks sensitive to DNA-level features, including messenger RNA and protein expression. Our approach advances domain-specific representation learning and opens avenues for sequence design in synthetic biology, as well as deeper insights into diverse bioprocesses.

Graphical Abstract

## Full-text entities

- **Genes:** TDH3 (glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) TDH3) [NCBI Gene 853106] {aka GLD1, HSP35, HSP36, SSS2}, CDS1 (CDP-diacylglycerol synthase 1) [NCBI Gene 1040] {aka CDS 1}
- **Diseases:** Toxicity (MESH:D064420), ID (MESH:C537985)
- **Chemicals:** dinucleotide (MESH:D015226), tryptophan (MESH:D014364), methionine (MESH:D008715), alanine (MESH:D000409)
- **Species:** Cricetulus griseus (Chinese hamster, species) [taxon 10029], Homo sapiens (human, species) [taxon 9606], Escherichia coli (E. coli, species) [taxon 562], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932]
- **Mutations:** L40S
- **Cell lines:** CHO — Cricetulus griseus (Chinese hamster), Spontaneously immortalized cell line (CVCL_0213)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12956333/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12956333/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC12956333/full.md

---
Source: https://tomesphere.com/paper/PMC12956333