AdDeam: a fast and scalable tool for estimating and clustering reference-level damage profiles

Louis Kraft; Thorfinn Sand Korneliussen; Peter Wad Sackett; Gabriel Renaud

PMC · DOI:10.1093/bioinformatics/btaf407·July 17, 2025

AdDeam: a fast and scalable tool for estimating and clustering reference-level damage profiles

Louis Kraft, Thorfinn Sand Korneliussen, Peter Wad Sackett, Gabriel Renaud

PDF

Open Access

TL;DR

AdDeam is a new tool that quickly identifies and groups DNA damage patterns in ancient DNA samples, helping researchers detect contamination and authenticate ancient DNA.

Contribution

AdDeam introduces a fast and scalable method for clustering DNA damage profiles across multiple samples or contigs.

Findings

01

AdDeam distinguishes different damage levels, including uracil-DNA glycosylase-treated samples and time-period-specific damages.

02

The tool can differentiate between contigs containing modern or ancient DNA fragments.

03

AdDeam provides a framework for aDNA authentication and large-scale analyses.

Abstract

DNA damage patterns, such as increased frequencies of C→T and G→A substitutions at fragment ends, are widely used in ancient DNA studies to assess authenticity and detect contamination. In metagenomic studies, fragments can be mapped against multiple references or de novo assembled contigs to identify those likely to be ancient. Generating and comparing damage profiles, however, can be both tedious and time-consuming. Although tools exist for estimating damage in single reference genomes and metagenomic datasets, none efficiently cluster damage patterns. To address this methodological gap, we developed AdDeam, a tool that combines rapid damage estimation with clustering for streamlined analyses and easy identification of potential contaminants or outliers. Our tool takes aligned ancient DNA (aDNA) fragments from various samples or contigs as input, computes damage patterns, clusters…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

UNG

Proteins1

Species10

Fretibacterium fastidiosum(species)Treponema denticola(species)Pan troglodytes(chimpanzee · species)Bacteria Latreille et al. 1825(Bacteria stick insect · genus)Alouatta seniculus(howler monkey · species)Alouatta(howler monkeys · genus)Tannerella forsythia(species)Porphyromonas gingivalis(species)Gorilla(genus)Homo sapiens(human · species)

Chemicals3

thymine uracil cytosines

Diseases3

CLASSIC MODE dental calculus periodontal disease

Figures4

Click any figure to enlarge with its caption.

Overview of the two modes of the AdDeam workflow: classic mode and meta mode. In classic mode (left), multiple BAM files, each representing different samples mapped to a reference genome (e.g., human), are processed with bam2prof, a lightweight C++ program included with AdDeam, to generate individual damage profiles. These profiles are then clustered using the GMM algorithm, assigning each sample a probability of belonging to each of the k (in this example k=2) clusters. In meta mode (right), one (or more) BAM file containing the original aDNA fragments aligned to a reference containing the assembled contigs is processed. The bam2prof program generates damage profiles for each reference within the BAM file. GMM clustering (shown for k = 3) then assigns probabilities for each contig’s cluster membership. The bar plots display these stacked membership probabilities by colour. AdDeam then outputs the representative damage profile for each cluster.

Clustering of simulated damage profiles. Upper panel: Ground truth distribution of samples across the three simulated damage levels used as input for AdDeam. The damage plots above the bars illustrate the range of substitution rates that individual samples within each damage level carry. Lower panel left: GMM algorithm clustering results for k=2 and k=3. For k=2, no-damage profiles form one cluster, while mid- and high-damage profiles are combined into a second cluster. For k=3, all profiles are correctly assigned to their respective categories. Lower panel right: PCA of damage profiles for k=3. The plot reveals three distinct clusters, with colour-coding derived from the GMM algorithm.

Clustering results for 49 samples aligned against the genomes of four bacterial species that represent the “core genera” of oral microbiomes. A single damage profile was obtained per sample using the alignments to these four genomes, which are usually present in the oral microbiome. Left: Probability assignments to each cluster (k=4). Right: Representative damage profiles for each cluster. AdDeam was run in “classic mode”, generating a single damage profile for all four bacterial species.

Classifying contigs based on their damage profiles from co-assembled fragments of modern and ancient sources. Three samples from Fellows Yates et al. (2021) were merged and assembled using MEGAHIT (Li et al. 2015). Fragments were mapped to contigs that are over 10,000 bp in length, followed by analysis with AdDeam to generate damage reports. The results demonstrate the ability to differentiate between modern and ancient contigs. Clustering analysis (k = 3) was applied to the VLC (21st century), TAF (Later Stone Age), and EMN (Upper Palaeolithic) datasets. The PCA plot is based on damage profiles from contigs over 10,000 bp, with clusters coloured according to the clustering results. The legend next to each cluster shows the representative damage profile (in order) and the proportion of sample-specific fragments aligning to the contigs in each cluster (Modern: VLC, Ancient (LSA): TAF, Ancient (UP): TAF).

Funding2

—Novo Nordisk Data Science Investigator
—Department for Health Technology at DTU

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForensic and Genetic Research · Yersinia bacterium, plague, ectoparasites research · Bacillus and Francisella bacterial research

Full text

1 Introduction

Validating the authenticity of ancient DNA (aDNA) samples is essential to ensure the reliability of scientific conclusions in evolutionary and archaeological studies (Green et al. 2009, Der Sarkissian et al. 2021). The authentication process relies on the characteristic molecular alterations that occur in DNA molecules. Two primary processes lead to the breakdown and chemical alteration of aDNA: hydrolytic mechanisms cause the degradation of DNA molecules, thus resulting in shorter fragments (Hofreiter et al. 2001b, Pääbo 1989), and deamination converts cytosines into uracil molecules. During DNA sequencing, these alterations are interpreted as base substitutions (Briggs et al. 2007). This phenomenon, known as “damage,” occurs predominantly at the ends of aDNA fragments. In double-stranded aDNA library preparation (Briggs and Heyn 2011), this leads to elevated frequencies of C $[eqn]$ T substitutions at the 5 $[eqn]$ end and G $[eqn]$ A substitutions at the 3 $[eqn]$ end (Briggs et al. 2007, Dabney et al. 2013). In contrast, for single-stranded aDNA library preparation (Gansauge et al. 2017, Kapp et al. 2021), only C $[eqn]$ T substitutions at the 5 $[eqn]$ end are observable. Base substitutions can be partially or nearly completely removed through wet-lab protocols using uracil-DNA glycosylase (UDG) treatment. Full UDG treatment excises uracil residues from aDNA fragments, effectively eliminating damage signatures. While this improves sequence accuracy, it also results in shorter fragment lengths (Briggs et al. 2010). Partial UDG treatment leaves some uracil residues intact, preserving a limited number of damage sites for authentication (Rohland et al. 2015). As a result, full UDG treatment produces fragments with no observable damage patterns, whereas partial UDG treatment retains minimal damage signatures. It should be noted that, in certain samples such as those from vertebrates, cytosines in methylated CpG dinucleotides undergo direct deamination to form thymine. Therefore, UDG treatment is ineffective at these sites, resulting in residual C $[eqn]$ T substitution frequencies even after full treatment (Briggs et al. 2010).

Damage patterns are used to determine whether the DNA being extracted from fossils or ancient environments is indeed ancient in origin (Hofreiter et al. 2001a, Briggs et al. 2007, Hofreiter et al. 2001b). DNA from modern contaminants will not display such damage patterns (Richards et al. 1995, Sampietro et al. 2006, Krause et al. 2010, Handt et al. 1996, Warinner et al. 2017, Key et al. 2017, Renaud et al. 2015). In addition, the levels of damage are informative about the conditions in which DNA was preserved, as C $[eqn]$ T substitutions towards the 5'-ends of aDNA molecules tend to increase over time (Sawyer et al. 2012). When several samples of eukaryotic species like humans are sequenced, researchers routinely screen multiple extractions and determine which carry damage patterns and at which level to determine authenticity (Pandey et al. 2024, Allentoft et al. 2024, Kjær et al. 2022). This involves aligning the aDNA fragments to a single reference from the eukaryotic species in question. In the case of metagenomic studies involving de novo assembly, damage patterns are used to determine which contigs are likely to be ancient by aligning the aDNA fragments back to the assembly (Hodgins et al. 2023, Rozwalak et al. 2024, Klapper et al. 2023, Brealey et al. 2020, Wibowo et al. 2021). A contig where the aligned fragments do not display damage patterns is more likely to be a modern contaminant (Borry et al. 2021). Furthermore, aDNA fragments from microbes can be aligned to multiple reference genomes, and damage patterns are used to determine which species are likely to be present and bona fine aDNA (Fellows Yates et al. 2021, Fernandez-Guerra et al. 2023, Bozzi et al. 2024, Spyrou et al. 2019). Regardless of which, researchers in aDNA are faced with the task of inspecting large numbers of damage pattern plots to identify outliers.

Several bioinformatic tools generate damage profiles from alignment files, but none are designed to cluster hundreds to thousands of profiles at once to reveal shared damage signatures. Manually inspecting individual damage plots is time-consuming and prone to oversight, and rapid clustering of similar patterns could immediately flag contamination, preservation states, or library-type effects (Al-Asadi et al. 2019, Slimak et al. 2024). Moreover, in ancient metagenomic studies involving de novo assembly, clustering contig damage profiles can streamline authentication workflows. For example, clusters of contigs that exhibit no damage can serve as calibration sets for statistical tools like PyDamage (Borry et al. 2021). Visually accessible clusters can inform model parameter selection, such as setting thresholds to distinguish ancient from non-ancient contigs, while clusters with ambiguous signatures flag those contigs for targeted inspection.

Existing tools fall into two broad categories. First, damage profilers such as mapDamage (Ginolhac et al. 2011, Jónsson et al. 2013) and PMDtools (Skoglund et al. 2014) compute damage profiles from a single reference genome, while DamageProfiler (Neukamm et al. 2021) can additionally distinguish multiple references within a single file. Second, metagenomic‐focused tools and pipelines such as pyDamage (Borry et al. 2021), metaDamage (Everett and Cribdon 2023), metaDMG (Michelsen et al. 2022), HOPS (Hübler et al. 2019), and aMeta (Pochon et al. 2023) quantify damage at the contig or taxonomic level, often integrating mapDamage, PMDtools, and alignment tools like MALT (Herbig et al. 2016, Vågene et al. 2018) into automated workflows.

To our knowledge, only aRchaic (Al-Asadi et al. 2019) attempts to cluster fragments by damage patterns, but it is limited to a single predefined reference (Williams et al. 2025) and is no longer actively maintained. This methodological gap motivated the development of AdDeam, which combines (i) efficient computation of damage profiles from one or many BAM files with support for paired‐end, single‐strand, or mixed libraries and (ii) a Gaussian mixture‐model clustering framework that groups similar profiles into k clusters. AdDeam then visualizes both individual and representative damage patterns, enabling rapid, multisample screening of damage patterns across diverse references.

Two modes are available: In classic mode, AdDeam analyses BAM files from different samples aligned to the same reference genome, useful for assessing damage in fragments aligned to a human reference. In meta mode, it evaluates fragments from multiple contigs within a single BAM file. This mode is aimed at researchers who wish to determine which contigs are potentially modern in origin.

Utilizing a Gaussian Mixture Model (GMM), AdDeam clusters similar damage profiles, providing both quantitative assessments and clear visualizations. This approach is ideal for organizing aDNA studies with multiple samples or datasets spanning numerous species (Kjær et al. 2022, Fernandez-Guerra et al. 2023, Fellows Yates et al. 2021, Margaryan et al. 2020, Zavala et al. 2021). AdDeam processes large collections of BAM files as input from any reference type (e.g., bacterial, human, or other genomes) as well as BAM files containing alignments to multiple references, such as contigs or grouped reference genomes. It outputs the degree of assignment of the samples or contigs to each cluster and reports the representative damage profile for each cluster.

We demonstrated AdDeam’s ability to accurately cluster damage profiles using simulated data and tested its performance on empirical datasets.

2 Materials and methods

In this manuscript, we distinguish between fragments and reads as follows: fragments represent the sequences of degraded aDNA molecules. In paired-end sequencing, each fragment is sequenced from both ends, producing two corresponding reads. Reads are individual observations of the same fragment. However, by trimming adapters and merging paired-end reads, one can reconstruct the most likely sequence representing the original fragment (Lien et al. 2023). Consequently, we consistently refer to aDNA fragments when discussing trimmed and merged sequences, such as those used for alignment to a reference.

AdDeam consists of two subsequent stages: (i) generating damage profiles from BAM files either for each sample in classic mode or per contig in meta mode (ii) clustering these based on their similarity into k clusters, where k is user-defined. By default, clustering is performed for $[eqn]$ and 4. Finally, AdDeam outputs a comprehensive report in PDF format as well as tab-separated files for convenient downstream analysis. The workflow of AdDeam is illustrated in Fig. 1.

Overview of the two modes of the AdDeam workflow: classic mode and meta mode. In classic mode (left), multiple BAM files, each representing different samples mapped to a reference genome (e.g., human), are processed with bam2prof, a lightweight C++ program included with AdDeam, to generate individual damage profiles. These profiles are then clustered using the GMM algorithm, assigning each sample a probability of belonging to each of the k (in this example k=2) clusters. In meta mode (right), one (or more) BAM file containing the original aDNA fragments aligned to a reference containing the assembled contigs is processed. The bam2prof program generates damage profiles for each reference within the BAM file. GMM clustering (shown for k = 3) then assigns probabilities for each contig’s cluster membership. The bar plots display these stacked membership probabilities by colour. AdDeam then outputs the representative damage profile for each cluster.

2.1 Stage 1: damage profile generation

We employ bam2prof, a tool written in C++, originally developed by Gabriel Renaud and Thorfinn Korneliussen (https://github.com/grenaud/bam2prof), to generate damage profiles from BAM files containing aDNA alignments. bam2prof iterates over all alignments and calculates the ratio of observed C $[eqn]$ T and G $[eqn]$ A mismatches per position. This stage is independent of the clustering stage and its sole aim is to compute damage patterns. Building upon the original implementation, we have extended the functionality with several key features:

2.1.1 Classic mode

Generates a single damage profile per BAM file. This function may be used to analyse a set of BAM files that are aligned to a single reference genome, regardless of whether the reference genome is of bacterial or eukaryotic origin (i.e., human genome with multiple chromosomes), as illustrated in Fig. 1, CLASSIC MODE panel.

2.1.2 Meta mode

Produces individual damage profiles for each reference sequence within a BAM file, enabling analysis of damage patterns with each record (be it a contig, scaffold, or chromosome) in a reference. This mode is particularly useful for samples where aDNA fragments have been aligned to a database containing multiple reference genomes, such as prokaryotic genomes in metagenomic studies. The result is one damage profile per record in the reference. Additionally, this mode is especially interesting for metagenome assemblies, allowing fragments used for assembly to be aligned back to the resulting contigs, producing a damage profile for each contig (see Fig. 1, META MODE panel).

2.1.3 Fast profile generation from large BAM files

BAM files can be very large, which makes computing damage profiles time-intensive. To address this, bam2prof implements an iterative convergence check that both accelerates processing and detects unstable (“zig‐zag”) profiles. After every n aligned fragments, bam2prof measures the absolute difference in substitution frequencies (e.g. C $[eqn]$ T) and terminates early if this difference falls below a user-defined threshold d. Users can adjust n and d (via -numAligned and -precision), although the default behaviour ensures processing of all aligned fragments.

This same convergence criterion identifies atypical profiles that are common in ancient metagenomic studies due to low coverage or mismappings. Such non-convergent profiles are written to a separate “_notConverged” directory for downstream analysis. A detailed analysis of convergence behaviour (tracking C $[eqn]$ T changes every 500 aligned fragments) and its impact on damage profile stability is provided in Supplementary Material (Section 3, available as supplementary data at Bioinformatics online).

2.1.4 Python wrapper for parallel processing of multiple input files

To facilitate efficient high-throughput analysis, we developed a Python wrapper around bam2prof that enables the parallel processing of multiple BAM files across multiple threads. This wrapper can be invoked with the command:addeam-bam2prof.py -i inputBams.txt -o outputDirwhere inputBams.txt is a file listing the input BAM files, and outputDir specifies the directory for outputting the generated damage profiles. All parameters available for the compiled bam2prof program are also supported by the wrapper script and passed on accordingly.

2.2 Stage 2: clustering of damage profiles into k clusters

The damage profiles generated by bam2prof for a single sample are stored as two tab-delimited text files: one representing the substitution rates at the 5' end of the DNA fragments and the other for the 3' end. In these files, rows correspond to the first nucleotide positions from each end (5' or 3'), and columns capture substitution rates for all 12 possible mismatches at each position.

We prepare the profiles for clustering by combining the damage rates. bam2prof considers only 5 positions away from the 5' and 3' end, which get converted into a feature vector of length p and therefore, by default, $[eqn]$ . Specifically, the first five elements of this vector capture the C $[eqn]$ T substitution frequencies at positions 1 through 5 from the 5' end, whereas the last five elements represent G $[eqn]$ A substitution frequencies at positions 5 through 1 from the 3' end. This approach focuses on substitution patterns that are informative for identifying aDNA, condensing each profile into a single line vector where one line represents a sample (or contig in meta mode). To improve separation during clustering, all values in the vector are transformed using the logarithm of the substitution frequencies. The logarithm scale is useful as a jump from 0% damage to 15% should be more noticeable than a 15% to 30% jump in damage rates. The logarithm space makes smaller values (which are likely to be a contaminant) more salient.

By concatenating the vectors from all profiles, we construct an $[eqn]$ matrix X, where each row represents one of m samples and each column corresponds to one of the ten informative positions p (five for C $[eqn]$ T and five for G $[eqn]$ A). This matrix serves as input for clustering the damage patterns using a GMM algorithm implemented using scikit-learn (Pedregosa et al. 2011). Additionally, AdDeam provides a -lib flag to accommodate different library types in a single run: -lib paired uses the C $[eqn]$ T column from the 5 $[eqn]$ matrix and the G $[eqn]$ A column from the 3 $[eqn]$ matrix; -lib single uses both C $[eqn]$ T columns (5 $[eqn]$ and 3 $[eqn]$ ); and -lib mixed uses all four columns (C $[eqn]$ T and G $[eqn]$ A from both ends) as the input vector for clustering.

Unlike deterministic clustering algorithms such as k-means, which assign each sample to a single cluster, the GMM algorithm assigns probabilities to each sample of belonging to every cluster, reflecting the model’s uncertainty in classification. This probabilistic approach allows samples to have partial membership across clusters, capturing cases where certain samples do not clearly belong to a single cluster. These ambiguous samples can then be further analysed downstream. However, the number of clusters, k, must be specified by the user in advance, as it is an inherent parameter of the GMM algorithm.

The GMM algorithm is a probabilistic model that assumes data is generated from a mixture of k Gaussian distributions. Each cluster is represented as a Gaussian distribution with its own mean and covariance structure, which together define the shape and orientation of the cluster. In our application, the number of Gaussian components, k, is specified by the user, with default values of $[eqn]$ , and 4. For each chosen k, the GMM estimates the parameters of each Gaussian distribution to best fit the observed data, which in this case are the damage profiles.

The likelihood of observing a sample $[eqn]$ under the GMM is defined as

[eqn]

where $[eqn]$ are the mixing coefficients, $[eqn]$ are the mean vectors, and $[eqn]$ are the covariance matrices of the Gaussian components. Here, $[eqn]$ represents the probability density function of the multivariate Gaussian distribution for component j, parametrised by its mean vector $[eqn]$ and covariance matrix $[eqn]$ . Each Gaussian component represents a potential cluster in the data, with its own probability distribution. Instead of using $[eqn]$ directly as the damage profile, we compute a weighted representative pattern for each cluster by averaging the damage patterns of all samples, weighted by their cluster probabilities. This allows researchers to scan k damage profiles rather than hundreds or thousands.

During the development of AdDeam, we explored different covariance structures for the GMM algorithm and found that a spherical covariance structure worked best. This structure assumes that each cluster’s covariance matrix is isotropic, meaning that clusters have the same spread in all directions, forming spherical shapes in multi-dimensional space. An advantage of this approach is its computational efficiency, as it requires estimating fewer parameters compared to more flexible structures like a full covariance matrix. The influence of different covariance matrices on clustering outcomes and their applicability to damage patterns is elaborated in the Supplementary Material, Section 2.3, available as supplementary data at Bioinformatics online. The model is optimized using the Expectation-Maximization algorithm, which iteratively maximizes the likelihood of the observed data under the model, assigning each sample a probability-based cluster membership.

GMM model fitting begins by initializing the component means, which can be chosen at random or set manually. In AdDeam, we initialize these means systematically. Here, “values” refers to the minima and maxima of the input matrix X:

For $[eqn]$ , we set $[eqn]$ to the minimum values and $[eqn]$ to the arithmetic mean of minimum and maximum values in linear space.For $[eqn]$ , we add a third mean, $[eqn]$ , set to the maximum values.For $[eqn]$ , additional means are distributed evenly between minimum and maximum values using linear interpolation.

When working with low‐coverage metagenomic data, damage profiles can be less reliable if generated from references with minimally aligned fragments (Everett and Cribdon 2023). Hence, the Python script that runs the clustering first prefilters damage profiles based on the number of assigned fragments, ensuring that only reliable profiles are included. Erroneous mappings or low‐read evidence can reduce profile accuracy and lead to incorrect clustering outcomes. To address this, profiles generated from references with fewer than a specified threshold of aligned fragments (default: 1,000) are excluded from clustering. Importantly, bam2prof also flags nonconvergent (“zig‐zag”) profiles and writes them to a separate “_notConverged” directory. These unstable profiles are not passed to the clustering algorithm unless the user explicitly points the script at their directory.

2.2.1 Output and visualisation

After clustering the damage profiles, each sample is assigned probabilities indicating its membership in each of the k clusters. However, some samples may exhibit nearly uniform probability distributions across multiple clusters, making their classification ambiguous.

To aid interpretation and identify potential outliers, we include a visualization step that projects the clustered damage profiles into a lower-dimensional space using Principal Component Analysis (PCA). Analogous to clustering in logarithmic space, we also perform the PCA on the logarithm of the substitution frequencies. The two most significant principal components are plotted, with each sample colour-coded according to its highest GMM cluster assignment. This provides an intuitive way to assess cluster separation and detect samples with uncertain assignments.

Additionally, we generate probability plots illustrating each sample’s cluster membership probabilities and compute representative damage profiles for each identified cluster as weighted averages based on GMM assignment probabilities.

For downstream analysis, all results, including sample identifiers, mapped read counts, cluster assignment probabilities, and Euclidean distances to GMM cluster centers in PCA space, are exported as tab-separated files.

3 Results

We focus on this simulation to demonstrate the feasibility of our clustering approach. As there are no existing tools that appear to cluster damage parameters in such a way, direct benchmarking is difficult. Most available methods either generate damage profiles or classify the probability that, for example, a contig is ancient (Ginolhac et al. 2011, Jónsson et al. 2013, Skoglund et al. 2014, Neukamm et al. 2021, Everett and Cribdon 2023, Pochon et al. 2023, Hübler et al. 2019, Michelsen et al. 2022, Borry et al. 2021), but they do not cluster damage profiles, which is one of the core functionalities of AdDeam. While aRchaic (Al-Asadi et al. 2019) could also perform clustering, it does so under the assumption of a single predefined reference, without distinguishing between multiple references within a BAM file. As a result, it cannot capture damage variation across multiple references, which is crucial in metagenomic studies. Additionally, aRchaic is no longer maintained, and despite extensive efforts, we were unable to install it successfully. We nevertheless include in the Supplementary Material (Section 2, available as supplementary data at Bioinformatics online) two comparative analyses: one benchmarking AdDeam against the original aRchaic results from Al-Asadi et al. (2019), and another comparing AdDeam clusters to PyDamage (Borry et al. 2021) classifications.

3.1 Clustering of simulated damage profiles

The damage patterns used for simulating ancient reads served as the ground truth, exhibiting distinct substitution rate intervals for no damage, mid-damage, and high damage (see details in Supplementary Material, available as supplementary data at Bioinformatics online). However, the estimated damage profiles generated by bam2prof inevitably deviate from this ground truth due to alignment biases. These biases arise because not all fragments are perfectly mapped to the reference, leading to overlapping damage rate intervals after estimation. This effect is further detailed in Supplementary Material, Section “Simulated Datasets: Additional Information,” and Fig. 1, available as supplementary data at Bioinformatics online.

In order to evaluate AdDeam’s ability to correctly cluster similar damage patterns, we used simulated datasets in which the ground truth damage levels were known to us but not to the algorithm. After running AdDeam’s clustering module, we compared its output with the ground truth. From a simulated dataset of 890 samples, we selected a subset comprising 290 samples with no damage, 145 samples with mid damage, and 29 samples with high damage to reflect the empirical heterogeneity of damage patterns. In total, 464 samples were analysed.

Figure 2, upper panel “Ground Truth of Simulation” illustrates the distribution of samples across the three damage levels that were simulated and subsequently used as input for AdDeam.

Clustering of simulated damage profiles. Upper panel: Ground truth distribution of samples across the three simulated damage levels used as input for AdDeam. The damage plots above the bars illustrate the range of substitution rates that individual samples within each damage level carry. Lower panel left: GMM algorithm clustering results for k=2 and k=3. For k=2, no-damage profiles form one cluster, while mid- and high-damage profiles are combined into a second cluster. For k=3, all profiles are correctly assigned to their respective categories. Lower panel right: PCA of damage profiles for k=3. The plot reveals three distinct clusters, with colour-coding derived from the GMM algorithm.

On the left side of Fig. 2, “Output 1: Gaussian Mixture Model Cluster Assignment,” the automatically generated clustering results for $[eqn]$ and $[eqn]$ are shown.

For $[eqn]$ , AdDeam grouped all no-damage profiles into a single cluster, while mid- and high-damage profiles were combined into a second cluster. Representative damage profiles for each cluster are also provided. The representative profile for the no-damage cluster closely aligns with the expected near-zero substitution frequencies. The representative profile for the combined mid- and high-damage cluster is more similar to the mid-damage profile. This results from the cluster being dominated by 145 mid-damage profiles, compared to only 29 high-damage profiles, making the representative profile weighted accordingly.

For $[eqn]$ , AdDeam successfully assigned all profiles to their correct categories. The representative profiles for the three clusters were distinct and aligned with the expected patterns of no-damage, mid-damage, and high-damage.

On the right side of Fig. 2, “Output 2: Principal Component Analysis of Damage Profiles in Regard to Output 1” presents a PCA plot of the clustering results for $[eqn]$ . The plot reveals three distinct clusters, visually apparent from the data point distributions and supported by colour coding from the GMM algorithm. No-damage, mid-damage, and high-damage profiles each form separate clusters. The uniform colour within each group indicates perfect consistency in the classification since each cluster only contains samples of the same simulated ground truth of damage. This colour-coded PCA plot is automatically generated by AdDeam and provided to the user.

3.2 Empirical analysis: mapping ancient metagenomic fragments to core oral microbiome genera

In order to evaluate AdDeam’s ability to cluster damage patterns in empirical datasets, we analysed fragments from 49 ancient metagenome datasets originally presented by Fellows Yates et al. (2021). These datasets were selected because key characteristics such as sample age and UDG treatment were known, providing metrics to validate AdDeam’s clustering performance. The samples, which were extracted from dental calculus, were obtained from various species, including howler monkeys, gorillas, chimpanzees, Neanderthals, ancient modern humans, and present-day modern humans.

We mapped the 49 samples against the reference genomes of four bacterial species: Tannerella forsythia, Porphyromonas gingivalis, Treponema denticola, and Fretibacterium fastidiosum. These species were chosen because they were described as “core genera” in the study by Fellows Yates et al. (2021), which defines a core genus as one found in at least two-thirds of the populations within a given host genus. These bacteria are highly prevalent across all host genera in the source study, including Alouatta (Howler monkey), Gorilla, Pan troglodytes (Chimpanzee), and Homo Sapiens. In modern humans, they are strongly associated with periodontal disease (Fellows Yates et al. 2021), and analysing their ancient genomes provides valuable insights into the evolution of the oral microbiome and its relationship to host health.

Therefore, the damage profiles for these four core genera species are of particular interest: their presence can be confidently inferred if ancient damage patterns are observed, and the effectiveness of UDG treatment can be assessed, ensuring the removal of base substitutions crucial for downstream analyses.

The BAM files containing the alignments of the 49 samples were analysed using AdDeam in “classic mode,” which means that a single damage profile was reported for the four “core genera”. Figure 3 shows the clustering results for $[eqn]$ and the metadata associated with the samples, including cluster assignment, sample source, sample age, and UDG treatment. We saw that $[eqn]$ provided a clear separation of the different samples. The left panel displays the probability assignment of each sample to a cluster. The uniform colour of each bar indicates that all samples were assigned to a cluster with probabilities close to 100%. The right panel shows the representative damage patterns for each cluster.

Clustering results for 49 samples aligned against the genomes of four bacterial species that represent the “core genera” of oral microbiomes. A single damage profile was obtained per sample using the alignments to these four genomes, which are usually present in the oral microbiome. Left: Probability assignments to each cluster (k=4). Right: Representative damage profiles for each cluster. AdDeam was run in “classic mode”, generating a single damage profile for all four bacterial species.

Clusters 1 and 2 exhibit representative damage profiles close to zero (cluster 1) and with minimal damage (maximum 5%) for cluster 2. In contrast, the representative profiles for clusters 3 and 4 display more pronounced damage patterns. Cluster 3 shows damage around 30% at the first position, while cluster 4 shows high damage with over 40% at the first position.

To provide additional context, Fig. 3 also includes metadata of the samples such as the sample source, the period they originate from, and whether they were UDG-treated or not. A Table that also includes the sample codes from Fellows Yates et al. (2021) can be found in the Supplementary Material, available as supplementary data at Bioinformatics online.

Figure 3 reveals that cluster 1 primarily contains all UDG-treated samples from all time periods, with a single exception. This exception is a sample from a howler monkey (Alouatta) dated to the 20th-21st century. The damage profile of this outlier shows exceptionally low substitution frequencies, consistent with the UDG-treated samples. This explains its assignment to cluster 1. The output files of AdDeam, including the damage profiles of all samples, are accessible at https://github.com/LouisPwr/AdDeamAnalysis.

Cluster 2 consists exclusively of non-UDG-treated samples from relatively modern time periods (20th-21st century). In contrast, clusters 3 and 4 only include samples from older periods, specifically the Mesolithic, Upper Paleolithic, Middle Paleolithic, and Later Stone Age.

3.3 Empirical analysis: clustering damage profiles from assembled contigs

As described in the Introduction AdDeam can generate damage profiles for individual records (i.e. contigs, scaffolds, or chromosomes) in a reference genome using bam2prof. This feature enables an interesting use case: clustering damage profiles derived from multiple reference sequences, such as those obtained through de novo assembly of ancient metagenomes. The fragments can be realigned to the contigs and determined which are potential contaminants.

To explore this, we analysed a co-assembly strategy by combining one modern sample (Sample ID VLC) and two ancient samples (Sample IDs TAF and EMN) all from Fellows Yates et al. (2021). The combined dataset was assembled using MEGAHIT (Li et al. 2015) and the resulting contigs were filtered to have lengths of at least 10,000 bp. The original fragments, which were used for assembly, were then mapped to these contigs, and the alignment files were analysed using AdDeam. The ENA accession numbers included in the FASTQ headers of each sample enabled us to trace which fragments originated from each sample. Therefore, we could determine for each cluster the fraction of sample-specific fragments that aligned to the contigs. This helped us infer the composition of the contigs, showing whether they were mostly assembled from modern or ancient fragments.

The results of this analysis are shown in Fig. 4. The left panel displays a PCA plot, where contigs are clustered into three groups ( $[eqn]$ ) based on their damage profiles. To validate whether the contigs clustered together match the composition of the reads from which they were assembled, we counted the fragments aligning to contigs in each cluster and calculated the fraction coming from each sample (VLC for modern, TAF for Later Stone Age, and EMN for Upper Paleolithic). The right panel of Fig. 4 shows that contigs in cluster 1 have mapped almost exclusively modern fragments ( $[eqn]$ ), contigs in cluster 3 have mapped almost exclusively ancient fragments, and contigs in cluster 2 represent a mix of both. These results align with the representative damage patterns shown.

Classifying contigs based on their damage profiles from co-assembled fragments of modern and ancient sources. Three samples from Fellows Yates et al. (2021) were merged and assembled using MEGAHIT (Li et al. 2015). Fragments were mapped to contigs that are over 10,000 bp in length, followed by analysis with AdDeam to generate damage reports. The results demonstrate the ability to differentiate between modern and ancient contigs. Clustering analysis (k = 3) was applied to the VLC (21st century), TAF (Later Stone Age), and EMN (Upper Palaeolithic) datasets. The PCA plot is based on damage profiles from contigs over 10,000 bp, with clusters coloured according to the clustering results. The legend next to each cluster shows the representative damage profile (in order) and the proportion of sample-specific fragments aligning to the contigs in each cluster (Modern: VLC, Ancient (LSA): TAF, Ancient (UP): TAF).

The PCA plot in Fig. 4 also shows that some contigs between clusters 1 and 2 are not clearly assigned to a single cluster by the GMM. These contigs are represented by lighter gradient colours, representing lower probabilities of cluster membership. This uncertainty in cluster assignment is consistent with their intermediate positions in the PCA plot, where these data points visually appear between clusters 1 and 2.

Execution of both AdDeam commands completed in under 2.5 minutes and required less than 400 MB of memory, as detailed in the Supplementary Material, available as supplementary data at Bioinformatics online.

4 Discussion

We developed AdDeam, a versatile and user-friendly tool for estimating DNA damage and clustering damage profiles. Designed to streamline the analysis and comparison of damage patterns, AdDeam fills a methodological gap when working with large datasets, such as those containing numerous samples or alignment files with multiple references. As publicly accessible aDNA datasets continue to grow, including million-year-old specimens with highly variable damage patterns (Kjær et al. 2022, van der Valk et al. 2021), the need for efficient and versatile screening tools becomes ever more critical.

Because the GMM requires the user to specify the number of clusters, k, it is important to consider what each choice represents. Intuitively, $[eqn]$ aligns with the fundamental question in aDNA research of damaged versus undamaged DNA molecules (Orlando et al. 2021). However, we have shown that the values of $[eqn]$ offer benefits as well. For example, $[eqn]$ streamlines quality control by partitioning modern (negative control), non-UDG, and UDG-treated libraries (Supplementary Material, Section 2, available as supplementary data at Bioinformatics online). De novo assembled contigs in ancient metagenomic studies can exhibit a broad spectrum of damage. In such settings, intermediate clusters (e.g., $[eqn]$ in Fig. 4) can capture contigs with ambiguous profiles, flagging them for further investigation. Therefore, clustering with $[eqn]$ can guide authentication workflows, for example, by using low‐damage clusters as a visual baseline for adjusting significance thresholds of tools like PyDamage (Borry et al. 2021) (see Supplementary Material Section 2.2, available as supplementary data at Bioinformatics online).

Future work on clustering damage profiles should further explore algorithms that account for positional dependencies. Our default spherical‐covariance GMM assumes independence between read‐position features, neglecting true sequential dependencies. However, in our benchmarks, GMMs with full covariance matrices performed poorly, whereas DTW‐based (Dynamic Time Warping) clustering performed similar to the spherical GMM but still failed to outperform it consistently (Supplementary Material, Section 2.3, available as supplementary data at Bioinformatics online). Investigating further methods to integrate positional dependencies could enhance clustering of complex datasets (e.g., mapped fragments to thousands of de novo assembled contigs) by revealing additional patterns and enabling more fine-grained clusters.

During the development of AdDeam, we considered whether clustering damage profiles might reveal preservation or depositional “fingerprints” beyond simple authentication. While Sawyer et al. (2012) noted that variable and often unknown preservation conditions make it unlikely for DNA damage to correlate directly with sample age, no large‐scale study has yet tested whether damage‐profile signatures are associated with geographic or environmental factors. By enabling rapid clustering, visualization, and versatile application to diverse datasets, AdDeam lays the groundwork for future explorations of the ecological and spatial dimensions of aDNA preservation.

Supplementary Material

btaf407_Supplementary_Data

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Al-Asadi H , Dey KK, Novembre J et al Inference and visualization of DNA damage patterns using a grade of membership model. Bioinformatics 2019;35:1292–8. 10.1093/bioinformatics/bty 77930192911 PMC 6821257 · doi ↗ · pubmed ↗
2Allentoft ME , Sikora M, Fischer A et al 100 ancient genomes show repeated population turnovers in neolithic Denmark. Nature 2024;625:329–37. 10.1038/s 41586-023-06862-338200294 PMC 10781617 · doi ↗ · pubmed ↗
3Borry M. , Hübner A., Rohrlach A. B., Warinner C. Py Damage: automated ancient damage identification and estimation for contigs in ancient DNA de novo assembly. Peer J 2021;9:e 11845. 10.7717/peerj.1184534395085 PMC 8323603 · doi ↗ · pubmed ↗
4Bozzi D , Neuenschwander S, Cruz Dávalos DI et al Towards predicting the geographical origin of ancient samples with metagenomic data. Sci Rep 2024;14:21794. 10.1038/s 41598-023-40246-x 39294129 PMC 11411106 · doi ↗ · pubmed ↗
5Brealey JC , Leitão HG, van der Valk T et al Dental calculus as a tool to study the evolution of the mammalian oral microbiome. Mol Biol Evol 2020;37:3003–22. 10.1093/molbev/msaa 13532467975 PMC 7530607 · doi ↗ · pubmed ↗
6Briggs AW , Heyn P. Preparation of next-generation sequencing libraries from damaged DNA. In: Shapiro B, Hofreiter M. (eds), Ancient DNA. Methods in Molecular Biology, Vol. 840. New York: Humana Press, 2012. 10.1007/978-1-61779-516-9_1822237532 · doi ↗ · pubmed ↗
7Briggs AW , Stenzel U, Johnson PLF et al Patterns of damage in genomic DNA sequences from a neandertal. Proc Natl Acad Sci USA 2007;104:14616–21. 10.1073/pnas.070466510417715061 PMC 1976210 · doi ↗ · pubmed ↗
8Briggs AW , Stenzel U, Meyer M et al Removal of deaminated cytosines and detection of in vivo methylation in ancient DNA. Nucleic Acids Res 2010;38:e 87. 10.1093/nar/gkp 116320028723 PMC 2847228 · doi ↗ · pubmed ↗