# MCDAG: indexing maximal common subsequences for k strings

**Authors:** Giovanni Buzzega, Alessio Conte, Roberto Grossi, Giulia Punzi

PMC · DOI: 10.1186/s13015-025-00271-z · Algorithms for Molecular Biology : AMB · 2025-04-19

## TL;DR

This paper introduces MCDAG, a tool for efficiently finding common subsequences in genomic sequences.

## Contribution

MCDAG is the first practical tool for indexing maximal common subsequences in genomic data.

## Key findings

- MCDAG can process pairs of sequences over 10,000 base pairs in minutes.
- The index size is only 4-7% larger than the theoretical minimum for two sequences.
- Index size increases significantly for three or more sequences.

## Abstract

Analyzing and comparing sequences of symbols is among the most fundamental problems in computer science, possibly even more so in bioinformatics. Maximal Common Subsequences (MCSs), i.e., inclusion-maximal sequences of non-contiguous symbols common to two or more strings, have only recently received attention in this area, despite being a basic notion and a natural generalization of more common tools like Longest Common Substrings/Subsequences. In this paper we simplify and engineer recent advancements in MCSs into a practical tool called \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\textsc {McDag}$$\end{document}MCDAG, the first publicly available tool that can index MCSs of real genomic data, and show that its definition can be generalized to multiple strings. We demonstrate that our tool can index pairs of sequences exceeding 10,000 base pairs within minutes, utilizing only 4-7% more than the minimum required nodes. For three or more sequences, we observe experimentally that the minimum index may exhibit a significant increase in the number of nodes.

## Full-text entities

- **Diseases:** LCS (MESH:D000083102)
- **Chemicals:** AF004885 (-)
- **Species:** Human immunodeficiency virus 1 (no rank) [taxon 11676], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12008955/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12008955/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12008955/full.md

---
Source: https://tomesphere.com/paper/PMC12008955