Calculating complexity of large randomized libraries
Yong Kong

TL;DR
This paper develops formulas and software to accurately calculate the mean and variance of unique sequences in large, randomized libraries with arbitrary nucleotide ratios, aiding in library design and evaluation.
Contribution
It introduces new formulas and a computer program to compute statistics of large randomized libraries with arbitrary nucleotide ratios, surpassing previous methods limited to small, equal ratios.
Findings
Nucleotide ratios significantly influence library statistics.
Skewed ratios require larger libraries for the same diversity.
The software can handle libraries with mutations in over 20 amino acids.
Abstract
Randomized libraries are increasingly popular in protein engineering and other biomedical research fields. Statistics of the libraries are useful to guide and evaluate randomized library construction. Previous works only give the mean of the number of unique sequences in the library, and they can only handle equal molar ratio of the four nucleotides at a small number of mutation sites. We derive formulas to calculate the mean and variance of the number of unique sequences in libraries generated by cassette mutagenesis with mixtures of arbitrary nucleotide ratios. Computer program was developed which utilizes arbitrary numerical precision software package to calculate the statistics of large libraries. The statistics of library with mutations in more than amino acids can be calculated easily. Results show that the nucleotide ratios have significant effects on these statistics. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
