AmalgaMo: flexible DNA motif merging

Orsolya Lapohos; Gregory J Fonseca

PMC · DOI:10.1093/bioadv/vbag043·February 11, 2026

AmalgaMo: flexible DNA motif merging

Orsolya Lapohos, Gregory J Fonseca

PDF

Open Access

TL;DR

AmalgaMo is a new tool that merges similar DNA motifs to improve the accuracy of predicting upstream regulators in genomic data.

Contribution

AmalgaMo introduces a novel motif merging algorithm optimized for regression-based motif enrichment analysis.

Findings

01

Merging motifs with AmalgaMo improves regression-based motif enrichment analysis.

02

AmalgaMo is an efficient and flexible command-line tool for motif merging.

03

The tool is supported by detailed documentation for genomic data interpretation.

Abstract

Inference of candidate upstream regulators via motif enrichment analysis is a common step in the interpretation of genomic data. However, redundancy in motif databases can negatively impact predictive value, especially when relying on regression-based motif enrichment analysis. Although various forms of motif clustering have been used to mitigate problems caused by redundancy, an algorithm optimized for downstream regression-based analysis is needed. We introduce AmalgaMo, an efficient and flexible command-line tool for merging highly similar motifs. Using publicly available human datasets, we demonstrate that merging motifs with our optimized settings greatly benefits regression-based motif enrichment analysis and provide detailed documentation that can serve as a reference for researchers inferring upstream regulators from genomic data. AmalgaMo is available on GitHub at…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes7

NFIC F3 NR4A2 NR4A1 CD4 HIC1 XCL1

Proteins7

Species1

Homo sapiens(human · species)

Chemicals1

AmalgaMo

Figures2

Click any figure to enlarge with its caption.

Example of AmalgaMo-merged motifs with relatively strict parameters. These motifs are qualitatively similar (grouped into two clusters in the HOCOMOCO clustered motif set), but they differ when considering core positional information content. With r=0, AmalgaMo places more emphasis on the information content within the core, allowing for the separation of motifs with such subtle differences.

Evaluation of motifs selected via monaLisa Lasso stability selection using a paired RNA-seq and ATAC-seq dataset from Pahl et al. (2024). The union of all TFs selected via the three motif sets being compared, sorted by magnitude in log 2 fold change (FC) gene expression. Black or white dots (left column set) indicate selection. Red and blue colored dots (center column set) indicate the log 2 FC of the corresponding TF, with stars marking DE TFs (| log 2FC|>1 and false discovery rate <0.05). Yellow and orange colored dots (right column set) indicate gene expression ( log 10 FPKM+1), with diamonds marking detection (mean FPKM ≥0.5 in at least one condition). FPKM: fragments per kilobase million.

Tables1

Table 1. Evaluation of TFs selected by monaLisa using different motif sets derived from HOCOMOCO, for three datasets.a

Dataset (DE TFs)	Motif collection	Selected TFs
Dataset (DE TFs)	Motif collection	Total	Differentially expressed	Not detected
Pahl et al. (91)	HOCOMOCO original	32	10 (31.3%)	7 (21.9%)
	HOCOMOCO clustered	97	19 (19.6%)	48 (49.5%)
	universalmotif merged	36	10 (27.8%)	15 (41.7%)
	AmalgaMo merged	79	20 (25.3%)	29 (36.7%)
	AmalgaMo representative	86	17 (19.8%)	45 (52.3%)
Li et al. (220)	HOCOMOCO original	31	15 (48.4%)	13 (41.9%)
	HOCOMOCO clustered	203	71 (35.0%)	133 (65.5%)
	universalmotif merged	37	19 (51.4%)	17 (45.9%)
	AmalgaMo merged	200	73 (36.5%)	126 (63.0%)
	AmalgaMo representative	90	35 (38.9%)	46 (51.1%)
Watt et al. (129)	HOCOMOCO original	16	5 (31.3%)	9 (56.3%)
	HOCOMOCO clustered	71	13 (18.3%)	42 (59.2%)
	universalmotif merged	21	4 (19.0%)	13 (61.9%)
	AmalgaMo merged	61	12 (19.7%)	41 (67.2%)
	AmalgaMo representative	71	14 (19.7%)	48 (67.6%)

Equations4

Funding1

—Natural Sciences and Engineering Research Council of Canada10.13039/501100000038

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Chromatin Dynamics · Genomic variations and chromosomal abnormalities · Gene expression and cancer classification

Full text

1 Introduction

Genomic data analyses often aim to uncover putative transcription factors (TFs) that explain epigenetic differences between two biological conditions. Finding these TFs is an important step in understanding the molecular processes underlying health and disease. One way to identify these TFs is through the analysis of DNA sequences within key genomic regions using motif enrichment methods. Crucially, this approach requires a high-quality non-redundant known TF binding motif set.

Regression-based motif enrichment analysis such as monaLisa Lasso stability selection (Machlab et al. 2022) can provide valuable insight into potential upstream regulators of chromatin states. By modeling changes in epigenomic signals as a function of motif presence with the Lasso penalty, TFs compete with one another in explaining this variability, somewhat mimicking underlying molecular mechanisms. However, if the regressor matrix contains highly collinear motifs, only one will be selected, ignoring candidates with equal potential.

A seemingly simple solution to this problem is to consolidate highly similar motifs. However, measurement of motif similarity is not trivial, and the optimal level of compression is unknown. Researchers applying motif enrichment analysis expect to obtain a reliable set of candidate upstream regulators that is narrow enough to guide further investigation but broad enough to capture all relevant avenues. Thus, an approach optimized for regression-based enrichment analysis is needed.

Several tools have been developed for grouping similar motifs, such as GMACS (Broin et al. 2015), RSAT matrix-clustering (Castro-Mondragon et al. 2017), GimmeMotifs cluster (Bruse and Heeringen 2018), abc4pwm (Ali et al. 2022), and universalmotif merge_similar (Tremblay 2024). Popular motif databases such as JASPAR (Rauluseviciute et al. 2024) and HOCOMOCO (Vorontsov et al. 2024) also have their own reduced motif collections, generated via hierarchical clustering. Despite the number of available tools, a flexible algorithm optimized for a specific downstream application, such as regression-based motif enrichment analysis, does not exist.

Here, we present AmalgaMo, a command-line tool that takes as input a set of DNA (or RNA) motifs in any common matrix format (HOCOMOCO, JASPAR, MEME, or CisBP) and iteratively merges them together using five different parameters that can be tuned for specific applications. We also demonstrate the optimal settings of AmalgaMo for regression-based DNA motif enrichment analysis using paired bulk RNA-seq and ATAC-seq datasets. Further, we leverage the 17 607 available human TF ChIP-seq experiments in the ChIP-atlas (Zou et al. 2022) to thoroughly document the effects of each parameter and guide users tailoring settings to other downstream applications.

2 Methods

2.1 Motif representation and metrics

AmalgaMo uses the information-theoretic representation of motifs originally described by Schneider et al. (1986). Briefly, at position j of a motif, we denote the Shannon information

[eqn]

where $[eqn]$ and $[eqn]$ is the position-probability matrix (PPM) with rows indexed by i and columns indexed by j. Then, the total information of a motif is $[eqn]$ and we define the total information ratio between a pair of motifs x and y

[eqn]

where $[eqn]$ denotes the total information of motif x. Next, we describe the similarity metric used by AmalgaMo: shared information-weighted cosine similarity. The shared information of two motifs at position j within their alignment region is defined

[eqn]

where $[eqn]$ is padded with background probabilities if there is an overhang. Then, we calculate the cosine similarity at each position along the alignment region and weigh it by the proportion of shared information at that position. Finally, the weighted positional similarities are summed to obtain the scalar similarity score

[eqn]

where $[eqn]$ is the total shared information between motifs x and y along their alignment region and $[eqn]$ .

2.2 The AmalgaMo algorithm

To ensure that the merged motifs are generated from high-quality matches and to reduce overall compute time, motif pairs are first filtered using three criteria. First, for any pair of motifs $[eqn]$ , there is a maximum length difference allowance. Second, we set a minimum total information ratio $[eqn]$ to consider merging the pair. Third, we find the number of bases within the high-information bounds of both motifs to determine their respective core lengths. These bounds are defined as the first and last positions with at least 1 bit of Shannon information. Then, we apply a maximum core length difference cutoff.

Once a candidate set of motif pairs is obtained, we compute their similarities. We find the best alignment between each motif pair, applying a predefined minimum overlap requirement. Then, the similarity score $[eqn]$ is calculated, padding overhangs with background probabilities.

Iterative merging begins with the most similar motif pair. To merge a pair of motifs, their PPMs are simply averaged. If the pair of motifs being merged includes previously merged motifs, each original motif is weighted equally in the final merged PPM. Then, the merged motif is aligned and scored against all other qualifying motifs. This process is repeated until no motif pairs passing the selected similarity score threshold remain.

2.3 AmalgaMo parameter selection

To offer flexibility, AmalgaMo has several parameters that can be adjusted by the user. Selecting values for these parameters is relatively simple, as they correspond to intuitive aspects of motif comparison. These include:

t, the minimum total information ratio (default $[eqn]$ ); m, the maximum length difference (default $[eqn]$ ); r, the maximum core length difference (default $[eqn]$ ); s, the similarity score cutoff (default $[eqn]$ ); and a, the minimum overlap during alignment (default $[eqn]$ ).

For example, a user may want to ensure that certain subtleties are preserved in the merged motifs by using strict settings, such as $[eqn]$ and $[eqn]$ . This parameter set will ensure that for motif sets whose consensus sequences (i.e. qualitative choices of nucleotide at each position) are almost identical, those with differing positional information contents will be separated (Fig. 1). These settings are detailed in the Supplementary Notes, available as supplementary data at Bioinformatics Advances online.

Example of AmalgaMo-merged motifs with relatively strict parameters. These motifs are qualitatively similar (grouped into two clusters in the HOCOMOCO clustered motif set), but they differ when considering core positional information content. With r=0, AmalgaMo places more emphasis on the information content within the core, allowing for the separation of motifs with such subtle differences.

2.4 Optimization and evaluation

Although there are many possible use cases for AmalgaMo, we primarily focused on finding the optimal parameter settings for effective regression-based motif enrichment analysis—specifically, the case where the target of the regression is the change in chromatin accessibility between two biological conditions. In order to achieve this goal, we first defined our measures of success. As biological conditions are engendered by differential TF activation, we reasoned that many TFs selected via motif enrichment analysis should be differentially expressed (DE). However, since not all chromatin-bound TFs may be DE, we also considered another measure: a large number of selected TFs should have detectable mRNA in at least one of the two conditions being compared. Still, it is possible for some relevant TFs to be present in protein form, without any corresponding mRNA. However, since the proteome is context-dependent, we reasoned that if enrichment analysis yields consistent results across different contexts, these metrics reliably capture predictive value.

To calculate these metrics, we needed high-quality paired bulk RNA-seq and ATAC-seq data. We obtained such data from three independent sources (Li et al. 2019, Pahl et al. 2024, Watt et al. 2025) offering biologically homogeneous samples with two comparable conditions (and $[eqn]$ replicates per condition). Data processing is described in detail in Supplementary Note 1, available as supplementary data at Bioinformatics Advances online. Then, given a merged motif set, we could evaluate its performance in the context of regression-based motif enrichment analysis by counting the number of selected TFs (i.e. TFs inferred to explain change in chromatin accessibility) that were also differentially expressed at the transcriptomic level (Fig. 1, available as supplementary data at Bioinformatics Advances online).

To find the optimal parameters, we performed a grid search over three parameters of AmalgaMo (t, m, and r; keeping $[eqn]$ and $[eqn]$ ), merging motifs from the HOCOMOCO v12 human core motif collection (Vorontsov et al. 2024). We used each of these 27 merged motif sets as input to monaLisa Lasso stability selection (Machlab et al. 2022), supplying the $[eqn]$ fold change (FC) in accessibility between two selected conditions as the target of Lasso regression (details in Supplementary Note 2, available as supplementary data at Bioinformatics Advances online). Then, we used RNA-seq data to obtain differentially expressed (DE) TFs for the same two conditions ( $[eqn]$ FC $[eqn]$ and FDR $[eqn]$ ), and counted how many mapped to motifs selected by monaLisa. Considering “positives” to be DE TFs and “negatives” to be non-DE TFs, we calculated Matthew’s correlation coefficient (MCC) for each parameter set. Finally, for each dataset, we ranked parameter sets by the number of DE TFs selected, breaking ties using MCC. As the optimal setting, we selected the parameters that consistently ranked first across all datasets.

To benchmark AmalgaMo against previous methods, we repeated the above evaluation using the HOCOMOCO v12 clustered human motif collection, generated by Vorontsov et al. (2024) using MacroAPE (Vorontsov et al. 2013) followed by hierarchical clustering. As another benchmark, we merged the HOCOMOCO v12 human core motif collection using universalmotif merge_similar (Tremblay 2024) with default parameters. We also included an alternative to AmalgaMo merged motifs in which a representative original motif was selected for each merged set (the medioid), rather than averaging their PPMs. Finally, we evaluated the original HOCOMOCO v12 human core motif collection alongside the merged/clustered motifs as reference.

We also assessed the effects of merging on motif log-odds scores and sensitivity via FIMO (Grant et al. 2011) using all available ChIP-seq datasets in the ChIP-atlas (Zou et al. 2022) (Supplementary Note 3, available as supplementary data at Bioinformatics Advances online) and documented additional considerations for AmalgaMo parameter selection such as motif source, quality, and database (Supplementary Note 4, available as supplementary data at Bioinformatics Advances online). Finally, we biologically validated merged motifs via enrichment analysis with AME (McLeay and Bailey 2010) (Supplementary Note 5, available as supplementary data at Bioinformatics Advances online).

3 Results

3.1 Improved predictive value of regression-based motif enrichment analysis

We ran Lasso stability selection from monaLisa (Machlab et al. 2022) using the original and clustered HOCOMOCO v12 motif sets (Vorontsov et al. 2024), and 27 merged motif sets obtained using AmalgaMo with different parameter settings. Using $[eqn]$ FC in accessibility as the target of Lasso regression, the merged motif set generated by AmalgaMo with parameters $[eqn]$ consistently performed best across all three datasets. This AmalgaMo merged motif set allowed monaLisa to select the greatest number of DE TFs for two of the three datasets, compared to other motif sets (Table 1). For these two datasets, the HOCOMOCO clustered set was a close second in terms of DE TF selection, but it yielded substantially more TFs that lacked detectable mRNA. For the Watt et al. (2025) dataset, the AmalgaMo representative motif set recovered the most DE TFs. However, the AmalgaMo merged and HOCOMOCO clustered motifs recovered close to the same number of DE TFs while yielding fewer TFs that lacked detectable mRNA.

We provide a more detailed look into motifs selected via AmalgaMo for the dataset of simulated versus unstimulated CD4 T cells from Pahl et al. (2024) in Fig. 2, available as supplementary data at Bioinformatics Advances online. Here, we can see that, despite keeping the similarity score threshold and minimum overlap constant ( $[eqn]$ and $[eqn]$ ) in our grid search, the t, m, and r parameters had great influence over the motifs selected in the subsequent enrichment analysis. This finding highlights the importance of including and tuning such parameters in a motif merging algorithm.

Evaluation of motifs selected via monaLisa Lasso stability selection using a paired RNA-seq and ATAC-seq dataset from Pahl et al. (2024). The union of all TFs selected via the three motif sets being compared, sorted by magnitude in log 2 fold change (FC) gene expression. Black or white dots (left column set) indicate selection. Red and blue colored dots (center column set) indicate the log 2 FC of the corresponding TF, with stars marking DE TFs (| log 2FC|>1 and false discovery rate <0.05). Yellow and orange colored dots (right column set) indicate gene expression ( log 10 FPKM+1), with diamonds marking detection (mean FPKM ≥0.5 in at least one condition). FPKM: fragments per kilobase million.

Using the optimal AmalgaMo merged motif set, we found substantial overlap between the selected motifs with the HOCOMOCO original and clustered motif sets (Fig. 2, available as supplementary data at Bioinformatics Advances online). However, each one resulted in the selection of unique motifs as a result of differences in grouping and competition enforced by Lasso regression. We also compared the absolute change in gene expression of selected and non-selected TFs. On average, selected TFs had a larger change in gene expression when they were derived from AmalgaMo merged motifs compared to the HOCOMOCO clustered set, but not compared to the original motif set. Among AmalgaMo merged hits consisting of more than one motif, 18 DE TFs were identified, 6 of which were also identified via the HOCOMOCO original motif set, and 13 via the HOCOMOCO clustered set.

Comparing the union of TFs selected via these three motif sets for the Pahl et al. (2024) dataset, there is a clear advantage to grouping motifs for regression-based motif enrichment analysis, especially using AmalgaMo (Fig. 2). For example, NR4A family TFs are well-known to have major regulatory roles upon stimulation of CD4 T cells (Odagiu et al. 2020). It is also known that they are regulated via their expression levels, making them good candidates for the evaluation of regression-based differential motif enrichment analysis results via RNA-seq data. Here, we found that only the AmalgaMo merged motif set allowed for the selection of NR4A2, while NR4A1 was selected with the HOCOMOCO original and AmalgaMo merged motif sets but missed by the HOCOMOCO clustered motif set. Altogether, these results demonstrate that merging motifs using our optimized settings boosts the predictive value of motifs selected via regression-based enrichment analysis.

3.2 Validation and the effect of merging on other aspects of motif enrichment analysis

For a well-rounded evaluation of AmalgaMo, we assessed changes in motif log-odds scores and sensitivities after merging, using all available human TF ChIP-seq data (17 607 experiments covering 682 TFs) from the ChIP-atlas (Zou et al. 2022). We found that many factors influenced these changes, including motif source data type (ChIP-seq and/or HT-SELEX) and motif quality. Though these findings are not surprising, we provide details and make suggestions for customization of parameter settings in Supplementary Notes 3 and 4, available as supplementary data at Bioinformatics Advances online.

For biological validation of merged motifs, we measured the enrichment of these averaged PPMs in ChIP-seq data using the independent statistic-based enrichment method AME (McLeay and Bailey 2010) and compared them to the enrichment of their original component motifs (Supplementary Note 5.1, available as supplementary data at Bioinformatics Advances online). We found that combining motif PPMs in this way did not impact their enrichment. Finally, although regression-based motif enrichment analysis likely benefits most from merging with AmalgaMo, we also assessed the results of AME (McLeay and Bailey 2010) on ATAC-seq data from Pahl et al. (2024) using merged motifs (Supplementary Note 5.2, available as supplementary data at Bioinformatics Advances online). We found that, with more relaxed AmalgaMo parameters, the AME ranking of enriched motifs was less consistent with that of the original HOCOMOCO motifs, emphasizing again the importance of tuning these additional constraints.

4 Discussion

Although we have demonstrated the benefits of applying AmalgaMo for regression-based motif enrichment analysis, this method is still not perfect. Even with optimized motif merging parameters, only a fraction of differentially expressed TFs were inferred to drive differential accessibility via regression-based motif enrichment analysis. AmalgaMo and other existing motif merging/clustering methods are not informed by the biological roles of TFs. As a result, they are subject to the following major pitfall: when two TFs with very similar motifs differ in their context-specific directional effects on gene expression, merging their motifs may cancel out their individual effects inferred by regression-based enrichment analysis. For example, evidence in the literature suggests that HIC1 tends to repress the expression of key genes during T cell activation while NFI factors tend to function as transcriptional activators (Burrows et al. 2017, Adam et al. 2020). Since their motifs are very similar, they were merged by AmalgaMo, and their individual effects were likely obscured in the process (Fig. 2). However, HIC and NFI motifs are separate in the HOCOMOCO clustered set, allowing for their selection by monaLisa. Unfortunately, this phenomenon is unavoidable, regardless of the chosen merged/clustered motif set. Thus, we encourage anyone applying regression-based motif enrichment analysis to consider trying multiple motif sets in order to reveal such cases, which may also be context-specific. For other applications, this problem is less relevant.

Another important limitation of AmalgaMo stems from the fact that PPMs of different TFs are averaged to create synthetic motifs. As TFs often have context-dependent binding preferences and motif models assume that the nucleotide binding preferences at each position are independent of the nucleotides at all other positions, the averaging of PPMs may be problematic (despite our finding that these merged motifs are enriched to a similar degree in ChIP-seq data as their original components). Unfortunately, this problem cannot be solved by choosing representative motifs instead. One possible workaround for future studies may involve first partitioning a motif collection by degree of positional independence, and then running AmalgaMo on the partition for which this assumption yields a good approximation of TF binding. For example, Zhao et al. (2012) showed that a binding energy model that takes into account non-independent interactions is much more effective for TFs in the bZIP and bHLH families in particular. Thus, excluding these families from the merging process may reduce the exacerbation of limitations accompanying motif model assumptions.

5 Conclusion

AmalgaMo is a flexible motif merging tool that can be tuned for specific applications. We provide thorough documentation that will allow users to tailor any human database to their needs, or easily produce a non-redundant database for any non-human TF set. We also demonstrate that, with the optimal parameter settings, AmalgaMo can balance the tradeoff between the positive predictive value of motif hits and the number of motifs selected by downstream regression-based motif enrichment analysis. This balance is key to providing researchers with a reliable set of candidate upstream regulators and mitigating misattribution of TF activity.

Supplementary Material

vbag043_Supplementary_Data

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adam RC , Yang H, Ge Y et al NFI transcription factors provide chromatin access to maintain stem cell identity while preventing unintended lineage fate choices. Nat Cell Biol 2020;22:640–50. 10.1038/s 41556-020-0513-032393888 PMC 7367149 · doi ↗ · pubmed ↗
2Ali O , Farooq A, Yang M et al abc 4pwm: affinity based clustering for position weight matrices in applications of DNA sequence analysis. BMC Bioinformatics 2022;23:83. 10.1186/s 12859-022-04615-z 35240993 PMC 8896320 · doi ↗ · pubmed ↗
3Broin PO , Smith TJ, Golden AA. Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach. BMC Bioinformatics 2015;16:22. 10.1186/s 12859-015-0450-225627106 PMC 4384390 · doi ↗ · pubmed ↗
4Bruse N , Heeringen SJV. Gimme Motifs: an analysis framework for transcription factor motif analysis. bio Rxiv, 10.1101/474403, 2018, preprint: not peer reviewed. · doi ↗
5Burrows K , Antignano F, Bramhall M et al The transcriptional repressor HIC 1 regulates intestinal immune homeostasis. Mucosal Immunol 2017;10:1518–28. 10.1038/mi.2017.1728327618 · doi ↗ · pubmed ↗
6Castro-Mondragon JA , Jaeger S, Thieffry D et al RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections. Nucleic Acids Res 2017;45:e 119. 10.1093/nar/gkx 31428591841 PMC 5737723 · doi ↗ · pubmed ↗
7Grant CE , Bailey TL, Noble WS. Fimo: scanning for occurrences of a given motif. Bioinformatics 2011;27:1017–8. 10.1093/bioinformatics/btr 06421330290 PMC 3065696 · doi ↗ · pubmed ↗
8Li L , Wang Y, Torkelson JL et al TFAP 2C- and p 63-dependent networks sequentially rearrange chromatin landscapes to drive human epidermal lineage commitment. Cell Stem Cell 2019;24:271–84.e 8. 10.1016/j.stem.2018.12.01230686763 PMC 7135956 · doi ↗ · pubmed ↗