# Efficient discovery of frequently co-occurring mutations in a sequence database with matrix factorization

**Authors:** Michael Robert Kolar, Valerie Kobzarenko, Debasis Mitra, Rob J De Boer, Jordan Douglas, Rob J De Boer, Jordan Douglas, Rob J De Boer, Jordan Douglas, Rob J De Boer, Jordan Douglas

PMC · DOI: 10.1371/journal.pcbi.1012391 · 2025-04-24

## TL;DR

This paper introduces a new method using matrix factorization to efficiently find co-occurring mutations in viral sequences, which could help understand virus evolution and vaccine design.

## Contribution

The novel contribution is a matrix factorization-based approach for efficiently identifying co-occurring mutations in large sequence databases.

## Key findings

- The method outperforms brute-force approaches in identifying co-mutational positions in SARS-CoV-2 Spike protein sequences.
- Identified co-mutations align with biologically significant mutations in Delta and Omicron variants.
- Tracking co-mutational patterns reveals insights into viral evolution and adaptability.

## Abstract

We have developed a robust method for efficiently tracking multiple co-occurring mutations in a sequence database. Evolution often hinges on the interaction of several mutations to produce significant phenotypic changes that lead to the proliferation of a variant. However, identifying numerous simultaneous mutations across a vast database of sequences poses a significant computational challenge. Our approach leverages a matrix factorization technique to automatically and efficiently pinpoint subsets of positions where co-mutations occur, appearing in a substantial number of sequences within the database. We validated our method using SARS-CoV-2 receptor-binding domains, comprising approximately seven hundred thousand sequences of the Spike protein, demonstrating superior performance compared to a reasonably exhaustive brute-force method. Furthermore, we explore the biological significance of the identified co-mutational positions (CMPs) and their potential impact on the virus’s evolution and functionality, identifying key mutations in Delta and Omicron variants. This analysis underscores the significant role of identified CMPs in understanding the evolutionary trajectory. By tracking the “birth" and “death" of CMPs, we can elucidate the persistence and impact of specific groups of mutations across different viral strains, providing valuable insights into the virus’ adaptability and thus, possibly aiding vaccine design strategies.

Mutations in biological sequences occur due to various factors, with viral surface proteins evolving under strong selective pressures to enhance their infectivity, immune evasion, or replication efficiency. The vast number of possible mutations, particularly in longer proteins, necessitate the discovery of efficient computational approaches to identify co-occurring mutations, as these may have underlying functional significance and serve as potential vaccine or therapeutic targets. This study introduces a novel methodology that applies a modified Levenshtein distance to construct a mutation matrix from SARS-CoV-2 receptor-binding domain (RBD) sequences of Spike Protein that is used to enter a cell boundary. Non-negative matrix factorization (NMF) is then applied to decompose the matrix, facilitating the detection of co-occurring positional mutations. The approach was evaluated against an optimized brute-force algorithm, ensuring that the method maintained computational efficiency and accuracy. The identified co-mutations were cross-validated with documented biologically significant mutations, leading to the discovery of relevant mutation patterns within the SARS-CoV-2 RBD. These findings highlight the potential of unsupervised learning techniques for uncovering biologically meaningful mutation patterns, providing a foundation for future studies in viral evolution.

## Linked entities

- **Diseases:** SARS-CoV-2 (MONDO:0100096)

## Full-text entities

- **Genes:** S (surface glycoprotein) [NCBI Gene 43740568] {aka spike glycoprotein}
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049]

## Figures

50 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12273922/full.md

---
Source: https://tomesphere.com/paper/PMC12273922