epLSAP-Align: a non-sequential protein structural alignment solver with entropy-regularized partial linear sum assignment problem formulation

Xuechen Zhang; Zhuoyang Chen; Junyu Li; Qiong Luo; Longjun Wu; Weichuan Yu

PMC · DOI:10.1093/bioinformatics/btaf309·May 20, 2025

epLSAP-Align: a non-sequential protein structural alignment solver with entropy-regularized partial linear sum assignment problem formulation

Xuechen Zhang, Zhuoyang Chen, Junyu Li, Qiong Luo, Longjun Wu, Weichuan Yu

PDF

Open Access

TL;DR

This paper introduces epLSAP-Align, a new method for aligning protein structures that improves accuracy and efficiency for non-sequential alignments.

Contribution

The novel contribution is formulating non-sequential protein alignment as an entropy-regularized partial linear sum assignment problem and solving it efficiently.

Findings

01

epLSAP-Align outperforms existing non-sequential alignment tools in terms of biological structure overlaps.

02

epLSAP-TM is at least 22% faster than USalign2 under the same conditions.

03

The method integrates well with existing frameworks like TM-align and MICAN.

Abstract

The three-dimensional protein tertiary structure alignment is a fundamental problem that seeks insights into functions and evolution. Previous structure alignment algorithms have adopted the sequential assumption and used dynamic programming solvers. However, many distantly related structures exhibit non-sequential similarities, and non-sequential alignment tools are less efficient and accurate than sequential ones. In this paper, we formulate the non-sequential alignment as the Entropy-regularized Partial Linear Sum Assignment Problem (epLSAP) and propose a solver based on Sinkhorn algorithms, referred to as epLSAP-Align. Compared with existing non-sequential alignment solvers, our epLSAP-Align can explicitly model the gap penalty, efficiently achieve global optimality and balance coverage and fidelity. We show that epLSAP-Align can be easily integrated into the existing frameworks,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

USalign2

Diseases1

NS

Mutations1

C7015-23G

Figures1

Click any figure to enlarge with its caption.

Graphical illustration of the epLSAP-Align workflow using an example pair abedn- ACDNB. Our goal is to align identical letters regardless of their cases. The left panel shows that epLSAP-Align can be integrated into a topology-dependent superposition method (e.g. TM-align) or a topology-independent superposition method (e.g. MICAN) before non-sequential alignment. After superposition, we calculate the SP-score (Yang et al. 2012) for similarity measurement. The right panel shows the non-sequential alignment procedure of epLSAP-Align with the given similarity matrix, and the calculation is detailed in Supplementary Section S1.

Tables3

Table 1.. Comparisons w.r.t. Reference-independent metrics Nali, RMSD (Å) and SO (%) on MALIDUP(-ns), MALISAM(-ns), HOMSTRAD, 64-difficult-case and RIPC dataset, respectively.

Dataset	Metric	Method
		epLSAP-TM	epLSAP-MICAN	USalign2 (fNS)	USalign2 (sNS)	FTAlign	MICAN	SPalignNS	CLICK	TM-align	SSM
MALIDUP	Nali	89	88	93	92	87	85	63	61	87	80
	RMSD	2.42	2.40	2.75	2.71	2.68	2.58	1.61	1.81	2.69	2.30
	SO(%)	75.8 ✔	76.2 ✔	74.5	74.9	73.0	67.2	62.3	62.3	70.4	66.2
MALISAM	Nali	65	65	69	69	64	62	35	35	63	57
	RMSD	2.72	2.66	3.19	3.18	3.12	2.99	1.82	1.95	3.15	2.75
	SO(%)	70.8 ✔	71.0 ✔	69.1	69.3	67.0	50.5	46.4	44.0	63.0	58.6
MALIDUP-ns	Nali	86	86	92	93	86	84	58	60	62	54
	RMSD	2.63	2.53	2.87	2.83	2.73	2.55	1.69	1.81	3.01	2.31
	SO(%)	70.7	72.3 ✔	70.1	70.4	71.4	66.6	57.7	58.6	47.5	47.1
MALISAM-ns	Nali	61	60	67	64	62	60	31	34	48	42
	RMSD	2.55	2.64	3.28	3.24	3.13	2.96	1.82	1.94	3.44	2.72
	SO(%)	66.8 ✔	65.4 ✔	65.1	64.8	65.0	63.2	40.9	42.9	45.7	44.5
HOMSTRAD	Nali	168	167	172	171	168	167	155	153	169	159
	RMSD	1.96	1.97	2.15	2.14	2.07	2.01	1.41	1.50	2.05	1.69
	SO(%)	88.0 ✔	87.9 ✔	86.7	86.6	86.5	83.8	88.1	86.3	85.4	84.5
64-difficult-case	Nali	79	78	83	81	83	80	57	51	82	70
	RMSD	2.67	2.70	3.03	3.09	2.98	2.90	1.78	1.94	2.97	2.59
	SO(%)	69.7	69.5 ✔	69.3	69.1	69.1	59.8	56.9	48.5	66.5	62.2
RIPC	Nali	157	155	179	168	155	146	130	123	135	112
	RMSD	2.58	2.58	3.30	3.57	3.28	3.06	1.91	1.97	3.47	2.63
	SO(%)	65.5	63.8 ✔	65.2	58.0	55.4	56.5	65.1	63.1	48.1	46.1

Table 2.. Reference-dependent comparisons w.r.t. EQR/Agr(%) on the annotated non-sequential datasets.

Method	epLSAP-TM	epLSAP-MICAN	USalign2 (fNS)	USalign2 (sNS)	FTAlign	MICAN	SPalignNS	CLICK	TM-align	SSM
RIPC	192/68.1	194/68.8	191/67.7	229/81.2	185/65.6	184/65.2	192/68.1	194/68.8	132/46.8	149/52.8
MALIDUP-ns	12223/64.9	12542/66.6	12194/64.7	12319/65.4	9532/50.6	9826/52.2	8950/47.5	9062/48.1	6231/33.1	7807/41.4
MALISAM-ns	2413/32.7	2797/38.0	2348/31.9	2565/34.8	2189/29.7	2516/34.1	1986/26.7	2032/27.6	1442/19.6	1676/22.7

Table 3.. Average CPU time (sec) comparisons between epLSAP-TM and USalign2 (sNS and fNS) on seven datasets.

Method	MALIDUP	MALISAM	MALIDUP-ns	MALISAM-ns	HOMSTRAD	64-diffcult	RIPC	Total average
epLSAP-TM	0.058	0.057	0.058	0.056	0.334	0.076	0.254	0.313
USalign (sNS)	0.066	0.065	0.067	0.066	0.432	0.087	0.378	0.404
USalign (fNS)	0.105	0.102	0.107	0.101	0.658	0.169	0.731	0.615
FTAlign	2.838	2.515	2.822	2.546	8.174	3.231	4.821	7.747
MICAN	0.157	0.102	0.156	0.103	1.175	0.156	0.316	1.093
SPalignNS	0.618	0.426	0.622	0.441	2.954	0.732	1.523	2.766
CLICK	5.527	2.214	5.651	2.229	8.654	8.476	5.753	8.348
TM-align	0.016	0.017	0.016	0.016	0.079	0.022	0.065	0.074
SSM	0.005	0.005	0.005	0.005	0.009	0.006	0.007	0.008

Equations10

Funding1

—Center for Ocean Research in Hong Kong and Macau

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Structure and Dynamics · Genomics and Phylogenetic Studies · Machine Learning in Bioinformatics

Full text

1 Introduction

Comparisons and alignments of three-dimensional protein structures at the atomic level resolution are important in computational biology and bioinformatics because protein structures are usually more conserved than sequences (Illergård et al. 2009), and enable diverse applications such as protein classification (Murzin et al. 1995), evolutionary relationship analysis (Barrio-Hernandez et al. 2023), protein functional prediction (Gligorijević et al. 2021), molecular analysis (Kuntz et al. 1994) and drug discovery (Borkakoti and Thornton 2023). To a large extent, structural similarity implies functional similarities, and unannotated functions of a protein can be predicted by comparing them with structurally similar proteins with known functions (Hasegawa and Holm 2009, Ma and Wang 2014).

Protein structure alignment has been studied for over forty years. Existing tools can be categorized into sequential (SQ) alignment and non-sequential (NS) alignment (Yuan and Bystroff 2005, Ma and Wang 2014). In sequential alignment, for any two aligned residue pairs i–j and $[eqn]$ – $[eqn]$ from two protein structures, if $[eqn]$ then $[eqn]$ (Strickland et al. 2005). Such an ordering relationship enables efficient search in dynamic programming solvers. Existing sequential alignment algorithms take two different approaches to exploring features in protein similarity measurement: flexible local feature alignment and rigid-body superposition optimization. Flexible local feature alignments use local features regardless of the superposition of proteins to be aligned. They are good at comparing proteins in different conformational states and locally identifying conserved regions. Representative flexible methods include 3D-BLAST (Yang and Tung 2006), CLePAPS (Wang and Zheng 2008), Kpax (Ritchie et al. 2012) and Foldseek (Van Kempen et al. 2024). In comparison, rigid-body superposition-based methods treat structures as rigid bodies and explore similarities by comparing their shapes. They are well suited for closely evolutionarily related protein comparisons. Representative rigid-body superposition-based methods include TM-align family (Zhang and Skolnick 2005, Zhang et al. 2022, Margelevičius 2024), SSM (Krissinel and Henrick 2004), CE (Shindyalov and Bourne 1998), DALI (sequential version) (Holm 2022), DeepAlign (Wang et al. 2013) and SPalign (Yang et al. 2012). Furthermore, many tools (Ritchie et al. 2012, Wang et al. 2013) adopt both flexible alignment and rigid body-based alignment to produce more comprehensive alignment results than those with the single methodology.

Over 17% of structurally similar proteins whose fragment rearrangement can only be identified by non-sequential alignments (Abyzov and Ilyin 2007). Common non-sequential alignments include circular permutation and local swap, and they are preferred in cases such as binding interface comparisons for evolutionary and functional inference, where the overall similarity of protein pairs may be more informative than sequential similarity. Representative non-sequential alignment tools include GANGSTA+ (Guerler and Knapp 2008), FlexSnap (Salem et al. 2010), CLICK (Nguyen and Madhusudhan 2011), MICAN (Minami et al. 2013, 2018), SPalignNS (Brown et al. 2016), FTAlign (Wen et al. 2020) and USalign2 (Zhang and Pyle 2022).

Among non-sequential methods, CLICK and GANGSTA+ focus on helices and strands features and use combinatorial search algorithms. FlexSnap adopts a greedy search strategy on fragment pairs. MICAN and FTAlign are top-performers in topology-independent superposition. SPalignNS and USalign2 break the ordering requirement of dynamic programming and use the asymmetric form of linear sum assignment problem (LSAP) (Burkard and Cela 1999), the general case of sequential alignment, for non-sequential search. To reduce the computational cost of the exact solution, SPalignNS uses the deep greedy switching (DGS) (Naiem and El-Beltagy 2013) and USalign2 uses the enhanced greedy search (EGS) algorithm (Hu et al. 2018).

These non-sequential methods, however, suffer from the following three drawbacks associated with the greedy search. First, the above-mentioned greed search methods are heuristic algorithms and global optimality is not guaranteed. Second, as studied in sequential alignment, the gap penalty for deletion/insertion is essential, mainly because it facilitates the matching for various biologically meaningful results. However, neither of the above greedy methods explicitly models of gap penalty, and deletion/insertion is checked during heuristic assignments, which may lead to a suboptimal solution. Third, existing non-sequential methods mainly focus on the fidelity (minimal average spatial deviations among aligned pairs) and disregard the overall coverage (maximal aligned residue numbers). Generally, the best fidelity of the minimum average spatial deviations and the maximum overall coverage cannot be achieved at the same time, and one increases at the cost of the other. An important topic in sequential alignment, the coverage-fidelity trade-off control, which emphasizes producing an alignment with the smallest deviation with as many aligned pairs as possible (Zemla 2003, Krissinel and Henrick 2004, Collier et al. 2017), can be further explored in non-sequential alignments. Therefore, we aim to consider the solutions to the above issues in non-sequential alignment.

To address these challenges, we propose epLSAP-Align for non-sequential alignment with three advantages of the sequential dynamic programming search algorithm, including explicit alignment gap modeling, efficient global optimality and balancing coverage-fidelity trade-off control. Specifically, epLSAP-Align uses entropy regularized optimal transport solvers (Sinkhorn 1967, Cuturi 2013, Brun et al. 2022) for both efficiently solving partial LSAP (pLSAP) and balancing coverage-fidelity trade-offs, and is integrated into two representative superposition alignment algorithms: topology-independent method MICAN Minami et al. (2013) and topology-dependent method TM-align (Zhang and Skolnick 2005). Comprehensive experiments on seven public datasets for non-sequential alignment demonstrate the effectiveness and efficiency of our proposed epLSAP-Align.

2 Materials and methods

2.1 epLSAP for non-sequential protein structural alignment

2.1.1 Gap modeling in non-sequential alignment

We illustrate the proposed epLSAP-Align with a simple example in Fig. 1. Here we focus on the right panel about non-sequential solver formulation. Given two sequences $[eqn]$ and $[eqn]$ , to model the gap penalty during sequence matching, we add the element ‘-’ to the end of each sequence representing a gap term, and the revised sequences are $[eqn]$ and $[eqn]$ with $[eqn]$ “–.” The non-sequential alignment with gap penalty corresponds to some invertible mapping $[eqn]$ with the following property

[eqn]

where the first line indicates that each element of $[eqn]$ and $[eqn]$ corresponds to an element in another sequence or “–,” and the second line indicates that a “–” has at least one image or preimage. In the LSAP optimizing diagram, the correspondence map to be solved between $[eqn]$ and $[eqn]$ is represented by the bi-permutation matrix $[eqn]$ , where the one-to-one match between the i-th element of $[eqn]$ and the j-th element of $[eqn]$ is represented by $[eqn]$ and mismatch as $[eqn]$ . In the pLSAP setting, we can modify M in LSAP into the gap-bi-permutation matrix $[eqn]$ to represent the correspondence map $[eqn]$ of Equation (1) in the matrix form as follows

[eqn]

where $[eqn]$ denotes the correspondence of the i-th element of $[eqn]$ and the j-th element of $[eqn]$ , and $[eqn]$ stands for the gap-to-gap correspondence. In structural alignment settings, the similarity matrix $[eqn]$ consists of the similarity scores between residues and gap penalties. For $[eqn]$ , $[eqn]$ , $[eqn]$ is the similarity score between $[eqn]$ and $[eqn]$ ; $[eqn]$ and $[eqn]$ are the gap penalties; $[eqn]$ is set to zero because it makes no difference to the result. The pLSAP for non-sequential protein alignment reads

[eqn]

where maximizing term indicates the objective of finding the optimal correspondence w.r.t. the largest total similarity, and equality constraints come from Equation (2). We then transform Equation (3) into its dual form for minimization. With positive coefficients $[eqn]$ ( $[eqn]$ ), $[eqn]$ ( $[eqn]$ ), and k, the term to be minimized reads

[eqn]

where

[eqn]

Graphical illustration of the epLSAP-Align workflow using an example pair abedn- ACDNB. Our goal is to align identical letters regardless of their cases. The left panel shows that epLSAP-Align can be integrated into a topology-dependent superposition method (e.g. TM-align) or a topology-independent superposition method (e.g. MICAN) before non-sequential alignment. After superposition, we calculate the SP-score (Yang et al. 2012) for similarity measurement. The right panel shows the non-sequential alignment procedure of epLSAP-Align with the given similarity matrix, and the calculation is detailed in Supplementary Section S1.

The new matrix $[eqn]$ is the cost matrix. To ensure $[eqn]$ , we can simply take all $[eqn]$ and $[eqn]$ as $[eqn]$ , and $[eqn]$ .

2.1.2 epLSAP and sinkhorn solver

To reduce the computation of the combinatorial optimization problem Equation (4), we first use the stochastic matrix to approximate the permutation matrix. In our setting, the gap-bi-stochastic matrix $[eqn]$ is the continuous version of $[eqn]$ . With the stochastic matrix surrogate, the dual form Equation (4) is a partial optimal transport (pOT) problem, and it has been well studied that the Sinkhorn algorithm can efficiently offer a decent approximate solution for entropy-regularized pOT problems. Formally, entropy-regularized pOT reads

[eqn]

where $[eqn]$ is the entropy of $[eqn]$ , and $[eqn]$ is the parameter to control the influence of the entropy term on the pOT solution. By using the Lagrange multiplier $[eqn]$ and $[eqn]$ , the optimal solution $[eqn]$ satisfies

[eqn]

After getting the optimal gap-bi-stochastic matrix $[eqn]$ , we can infer the alignment by taking indices with maximum $[eqn]$ values from the row or column directions and eliminating the non-unique correspondence. The alignments from both row and column directions are meaningful and largely similar in most cases. We can choose results in either row or column direction by costumed criteria. In this study, we choose the alignment with the higher TM-score (Zhang and Skolnick 2004) as default. The Sinkhorn solver for $[eqn]$ and alignments is in Supplementary Algorithm S1, also with a concrete computational example attached. Previous analysis (Cuturi 2013) shows that the time complexity of Sinkhorn algorithm is $[eqn]$ , where n is the maximum length of the two proteins. Compared with previous LSAP heuristic approximation algorithms for non-sequential alignment, whose theoretical time complexity is also $[eqn]$ , our epLSAP benefits from its insensitivity to initialization. Specifically, greedy search algorithms such as those used in USalign2 depend on initial seeds, and selecting different initial seeds will lead to different results. To eliminate the effect of initial seeds, more than one round of greedy search is performed. In contrast, the convergence of epLSAP regardless of initialization is ensured by the property of the Sinkhorn algorithm, whereas USalign2 uses greedy search with different initializations and chooses the best.

2.1.3 Entropy regularization and coverage-fidelity trade-off

The entropy regularization in Equation (5) not only provides an efficient solver for pLSAP, but is also a biologically meaningful controller for coverage-fidelity trade-offs in protein structural alignment. In the assessment of structural alignment, the number of aligned residues (Nali), and the spatial or geometrical similarity across aligned residues, such as root mean square deviation (RMSD), namely coverage and fidelity metrics, respectively, usually yield inconsistent results during optimization. For example, aligning all residues will lead to the best Nali but poor RMSD because all the implausible alignments are included. In contrast, only picking the nearest aligned pairs will lead to the best RMSD but the worst Nali. Because coverage-fidelity trade-offs vary in different applications, it is desirable that the function to be optimized can be tunable for coverage-fidelity trade-offs. In Equation (5), the pOT part corresponds to maximizing fidelity by aligning the pairs with the smallest costs, and the entropy part aims to control coverage by maximizing the entropy $[eqn]$ . When $[eqn]$ is small, it indicates that the majority mass of $[eqn]$ is concentrated on a small portion of entries in $[eqn]$ , and the alignment inferred from such a low entropy $[eqn]$ will lead to a small Nali. In contrast, a high-entropy $[eqn]$ uniformly distributes the mass to each entry and increases the probability of finding the aligned pair of each residue, leading to a large Nali. Therefore, the coverage-fidelity trade-off for alignment can be tunable by setting $[eqn]$ . When $[eqn]$ is among plausible intervals for the Sinkhorn solver, a smaller $[eqn]$ indicates more weight on coverage, and a larger $[eqn]$ indicates more focus on fidelity. Our experimental results in Section 3.5 confirm these analysis.

2.2 Implementation of epLSAP-Align

To demonstrate the flexibility of the epLSAP-Align as an accurate and efficient solver for non-sequential protein structural alignment with different superposition methods, we integrate it into two representative structural alignment frameworks: MICAN (Minami et al. 2013) and TM-align (Zhang and Skolnick 2005). MICAN is the representative method for using the greedy search strategy on the secondary structural element (SSE) for topology-independent superposition. TM-align is the representative method for using Kabsch method-based topology-dependent superposition. We implement epLSAP-Align by first using the superposition estimated from MICAN and TM-align, denoted as epLSAP-MICAN and epLSAP-TM, respectively. Then with the superposed structures, we adopt SP-score (Yang et al. 2012) to calculate the score matrix for alignment optimization as follows

[eqn]

where L is the sum of aligned core residues, scale factor $[eqn]$ , and size-independent normalization factor $[eqn]$ . The SP-score is used for the final alignment because its fixed normalization factor $[eqn]$ ensures that only sufficiently close residue pairs are considered as candidates. This leads to more robust results compared to size-dependent normalization scores [e.g. TM-score (Zhang and Skolnick 2004)], particularly when exploring non-sequential structural correspondence between large proteins. The default parameters in Supplementary Algorithm S1 are $[eqn]$ , $[eqn]$ and $[eqn]$ .

2.3 Evaluation metrics for non-sequential alignment

To assess the non-sequential alignment quality, we first use three standard reference-independent metrics (Brown et al. 2016, Wen et al. 2020): the number of aligned residues (Nali), root mean square deviation (RMSD), and structure overlap (SO). Given query protein Q and target protein T with n aligned residues $[eqn]$ and $[eqn]$ , respectively, where $[eqn]$ , the Euclidean distance between the i-th aligned residue after superposition is $[eqn]$ , the formal definitions of Nali, RMSD and SO between Q and T read

[eqn]

where $[eqn]$ is the indicator function on judging whether $[eqn]$ is less than $[eqn]$ . Here, Nali measures the coverage, RMSD measures the fidelity, and SO measures the performance on the coverage-fidelity trade-off. To further explore how the aligned residues distribute across different RMSD thresholds, we expand the classical SO into a function about structure overlap w.r.t. varied RMSD threshold as follows

[eqn]

Following previous work (Brown et al. 2016, Zhang and Pyle 2022), we also adopt two reference-dependent metrics. The first is equivalent reference residues (EQR), defined as the total number of aligned residue pairs shared by the manually curated reference alignment and the alignments from automated alignment tools. The second is the percentage of agreement, defined as EQR divided by the length of the reference alignment.

2.4 State-of-the-art tools for comparisons

To evaluate the performance of the proposed epLSAP-Align, we use six state-of-the-art structure alignment tools for comparison. Five of these six tools are the most recent non-sequential structure alignment tools, including USalign2 (Zhang et al. 2022), FTAlign (Wen et al. 2020), SPalignNS (Brown et al. 2016), MICAN (Minami et al. 2013) and CLICK (Nguyen and Madhusudhan 2011). In USalign2, there are two non-sequential aligning modes. USalign2 (fNS) treats residues as point clouds and abandons any sequential order information, and USalign2 (sNS) preserves the sequential order within aligned fragment pairs. Also, we choose two widely used tools, TM-align (Zhang and Skolnick 2005) and SSM (Krissinel and Henrick 2004), which has decent performance even in both sequential and non-sequential cases. To use these methods for comparisons, we download their latest versions and run them locally with the default parameters.

2.5 Non-sequential alignment test datasets

To evaluate the non-sequential alignment performance of our epLSAP-Align, we use seven widely used datasets, including MALIDUP, MALISAM, MALIDUP-NS, MALISAM-NS, RIPC, 64-difficult-case, and HOMSTRAD. We summarize each dataset in Supplementary Section S2.

3 Results

3.1 Reference-independent evaluations

Table 1 shows the reference-independent comparisons of the representative non-sequential alignment methods. We also include the widely used sequential alignment method TM-align as a reference to compare the performance improvement from non-sequential methods applied on datasets with great extents of NS. Except for the HOMSTRAD dataset, either epLSAP-TM or epLSAP-MICAN achieves the best SO scores, which balance Nali and RMSD. On the HOMSTRAD dataset, our epLSAP-TM and epLSAP-MICAN perform slightly worse than SPalignNS on SO, but better than the other methods. Noticeably, our epLSAP-MICAN is based on the superposition matrix of MICAN but with the proposed epLSAP-Align, and epLSAP-MICAN performs better than the original MICAN. Similarly, USalign2 uses the greedy algorithm for heuristic search whereas epLSAP-TM uses epLSAP-Align for global optimization instead. The improved SO performance of epLSAP-MICAN and epLSAP-TM indicates that the proposed epLSAP algorithm can be combined with various existing (non-)sequential structural alignment frameworks, and the explicit controlling of coverage-fidelity trade-off of epLSAP helps produce results with better structural overlap.

We use the two-sample t-test and binary plot (Brown et al. 2016) to show the significance of the SO improvement brought by the epLSAP solver. The baseline of epLSAP-TM is Usalign2 (fNS) and the baseline of epLSAP-MICAN is MICAN, respectively. Table 1 shows that except those of epLSAP-TM on MALIDUP-ns, 64-difficult-case and RIPC datasets, all comparison results are statistically significant with $[eqn]$ . Because epLSAP-MICAN performs significantly better than MICAN, we therefore focus on the detailed distribution comparison w.r.t. the SO improvement between epLSAP-TM and USalign2, as shown in Supplementary Fig. S3. For $[eqn]$ samples of the datasets used, epLSAP-TM performs better than USalign2 (fNS).

To better show how the aligned residues distribute across different RMSD thresholds, we use the varying structure overlaps w.r.t. RMSD in Equation (9). Here we vary the threshold from 3.0 to 7.0 and record the corresponding overlap at intervals of 0.5 in Supplementary Fig. S4. We can see that $[eqn]$ curves of both CLICK and SPalignNS are much more stable than others. Also taking their properties of lower Nali and RMSD, as shown in Table 1, we can conclude that CLICK and SPalignNS are specialized in good fidelity results, where the aligned residues are very close after superposition. TM-align is the sequential method and its $[eqn]$ performs not well in non-sequential datasets. Among the remaining methods, our epLSAP-TM and epLSAP-MICAN achieve higher $[eqn]$ with small thresholds, and we show this result in Supplementary Fig. S4 by specifying two vertical lines when the threshold is $[eqn]$ and $[eqn]$ , the latter $[eqn]$ is another important threshold to quantify the structure overlap (Zhang and Skolnick 2005). As RMSD threshold increases over $[eqn]$ , methods with higher Nali than our epLSAP-TM (MICAN), including USalign2 (fNS) and USalign2 (sNS), achieve higher $[eqn]$ than epLSAP-TM (MICAN). FTAlign shares a similar Nali with epLSAP-TM (MICAN), and at a high RMSD threshold level ( $[eqn]$ ), it begins to achieve a similar or higher $[eqn]$ than epLSAP-TM (MICAN). epLSAP-MICAN improves MICAN on both Nali and RMSD metrics. All these results contribute to the conclusion that the proposed epLSAP solver can help obtain the alignment results with a better coverage-fidelity trade-off.

3.2 Reference-dependent evaluations

We conduct reference-dependent evaluations on three datasets with non-sequential annotation: RIPC, MALIDUP-ns and MALISAM-ns. RIPC annotation is relatively sparse with only 282 annotated pairs. MALIDUP-ns and MALISAM-ns, two artificial non-sequential datasets, have 18837 and 7370 annotated pairs, respectively. When evaluated with the RIPC dataset, as shown in Table 2, our method epLSAP-MICAN and CLICK produce the second highest EQR, and epLSAP-TM performs slightly better than its most similar competitor USalign2 (fNS). For more detailed comparisons, we include the EQR of each sample pair in the RIPC dataset in Supplementary Table S2, in which we can see that for pair d1nkl___d1qdma1, USalign2 (fNS and sNS), SPalignNS and CLICK performs much better than others, and aligned pairs of d1nkl d1qdma1 take a large portion (72 out of 282), which makes USalign2 (sNS) perform much better than our epLSAP-based methods. For the remaining samples, our epLSAP-based methods have slightly better overall performance than the others. Experiments on MALIDUP-ns and MALISAM-ns datasets are at their much larger scales, as shown in the second and third rows of Table 2. In MALIDUP-ns and MALISAM-ns, our epLSAP-MICAN achieves the best EQR, and epLSAP-TM is better than USalign2 (fNS).

3.3 Computational efficiency

To fairly evaluate the computational efficiency of the proposed epLSAP-Align, we compare the average CPU time of epLSAP-TM with the other methods, especially focusing on USalign2 in the same programming environment because it is one of the fastest non-sequential alignment methods. The CPU environment for all experiments is Intel(R) Xeon(R) Gold 6130 CPU @ 2.10 GHz core, and the CPU time is summarized in Table 3. From the results, we can see that on average our epLSAP-TM is 22.5% faster than USalign2 (sNS), and 44.9% faster than USalign2 (fNS). We also include the average CPU time of TM-align and SSM, two widely used alignment methods not restricted in non-sequential cases. Although non-sequential methods achieve better performance w.r.t. the above non-sequential evaluation metrics, we observe that non-sequential methods are much slower than TM-align and SSM, and SSM is significantly faster than others.

3.4 Discussions on the gap modeling

Structural alignments are partial matching in nature. Give two sequences $[eqn]$ and $[eqn]$ with $[eqn]$ , only their subsets $[eqn]$ and $[eqn]$ ( $[eqn]$ ) are aligned, where $[eqn]$ and $[eqn]$ indicate permutation functions. Classical LSAP formulation without gap modeling aims to obtain the bi-permutation matrix $[eqn]$ indicating whether the i-th element of $[eqn]$ and j-th element of $[eqn]$ is matched. Ideally, this formulation will give an m-length alignment, where the shorter sequence will be aligned with a subset of the longer one. The deletion and insertion in classical LSAP formulation happen during the inference stage, such as Line 22–27 in Supplementary Algorithm S1 or choosing the pairs with the highest similarity for exchange at each iteration in the greedy algorithm of USalign2 (Hu et al. 2018, Zhang and Pyle 2022). In short, we can see that classical LSAP formulation does not fully consider the partial matching nature, and may lead to sub-optimal results. In contrast, the partial LSAP formulation in Equations (1) and (2) models the partial matching by introducing gap terms where all the unaligned elements are matched to the gap terms. Our formulation yields better performance when compared with non-sequential methods in entropy-regularized LSAP (classical LSAP with no gap modeling but with entropy regularizer, denoted as eLSAP-TM and eLSAP-MICAN, respectively), as shown in the Supplementary Table S1.

3.5 Coverage-fidelity trade-off by different λ

As discussed in Section 2.1.3, parameter $[eqn]$ can help control coverage-fidelity trade-offs in aligned pairs. Here we use different $[eqn]$ from values 10 to 300 to execute epLSAP-TM and epLSAP-MICAN on the RIPC dataset, and record the corresponding changes of Nali, RMSD, and SO w.r.t $[eqn]$ . As shown in Supplementary Fig. S5a and b, Nali and RMSD increase with $[eqn]$ , which indicates that a larger $[eqn]$ [more focus on the entropy term of Equation (5)] indeed leads to alignments of higher coverages (more Nali) but worse fidelity (larger RMSD). Supplementary Figure S5c shows the coverage-fidelity trade-off with an increasing $[eqn]$ . Supplementary Figure S5d shows that SO rapidly becomes stable regardless of the increase of $[eqn]$ , though different coverage-fidelity are shown in Supplementary Fig. S5c. These results demonstrate that our epLSAP-Align shows robust performance w.r.t. SO metric with varied $[eqn]$ .

3.6 Case study

To further illustrate the advantages of the proposed epLSAP-Align for non-sequential alignment, we use the protein pair d1af2a1.1–d1af2a1.2 from the manually permuted MALIDUP-ns dataset as an example in Supplementary Fig. S6. In C-H black symbols in line segments are ground truth correspondence, green symbols indicate the algorithm correspondence, and the sizes of these symbols are linearly scaled to the distance between the aligned $[eqn]$ atoms. In this pair, the tail residues of d1af2a1.1 are manually adjusted to correspond to the head of d1af2a1.2, as shown in Supplementary Fig. S6c. In D-F, although our epLSAP-TM starts with the same initial superposition as USalign2 (fNS) and USalign2 (sNS), epLSAP-TM achieves fewer mismatches than both USalign2 (fNS) and USalign2 (sNS). The number of outliers in F is much smaller than those in D-E, and the scales of mismatched green symbols in F are also smaller than those in D-E, indicating that epLSAP-TM produces fewer mismatches with large deviations. As shown in Supplementary Fig. S6g and h, based on the same superposition B, MICAN produces sequential-like alignment, whereas our epLSAP-MICAN produces non-sequential alignment as expected.

We further use the case of the pair 2ES9-1SXJ in SPalignNS (Brown et al. 2016) to illustrate the usefulness of epLSAP solver. We show Nali-RMSD-SO metrics obtained using different methods in the Supplementary Table S3. Our epLSAP-MICAN achieves the best SO performance, and SPalignNS achieves the second best SO performance, with less Nali but significantly lower RMSD. Because epLSAP-MICAN and SPalignNS have different superpositions, to illustrate that the performance gain is brought by the proposed epLSAP solver instead of superposition difference, we further use the proposed epLSAP solver based on the superposition estimated from SPalignNS, and record the results as epLSAP-SP_rigid. When sharing the same superposition as SPalignNS, we can see that our epLSAP-SP_rigid gets the non-sequential alignment with much more Nali at the cost of moderate RMSD increase, thus leading to a better coverage-fidelity trade-off w.r.t. the SO metric.

4 Conclusion

In this study, we present epLASP-Align, a new non-sequential alignment solver by formulating the alignment problem as the partial linear sum assignment problem and using the Sinkhorn algorithm to solve the entropy-relaxed surrogate. The proposed epLSAP-Align is flexible to be deployed in the existing protein structural alignment frameworks to explore non-sequential correspondences. Compared with the existing non-sequential alignment solvers, epLSAP-Align has three advantages: explicit gap penalty modeling, efficient global optimality and coverage-fidelity trade-offs. To demonstrate the effectiveness, we integrate epLSAP-Align into the existing popular alignment tools such as TM-align and MICAN to form new non-sequential alignment tools, namely epLSAP-TM and epLSAP-MICAN. Both epLSAP-TM and epLSAP-MICAN achieve generally higher biologically meaningful structure overlaps and higher computational efficiency on seven test datasets. Therefore, we expect that epLSAP-Align can be deployed in more alignment frameworks to facilitate the exploration of biologically meaningful non-sequential relationships.

Some further improvements are anticipated. First, suitable post-processing techniques can be developed to refine the resulting alignment. For example, outlier alignments shown in Supplementary Fig. S6f and h can be eliminated by considering pair distance or secondary structure consistency. Secondly, more sophisticated scoring schemes taking secondary structures or internal coordinates into account (Wang et al. 2013) may improve the reliability of the results. We also notice that the acceleration of the Sinkhorn-type algorithm for the optimal transport problem is a valuable and popular topic in machine learning and optimization, and recent acceleration algorithms such as (Solomon et al. 2015, Alaya et al. 2019) leverage the properties of the cost matrix for specific tasks. In the future, an interesting direction is to explore more efficient Sinkhorn-type algorithms with specific considerations of the properties of non-sequential protein structural alignment. A driving forces for developing new methods in structural alignment is their feasibility on the genomic-scale search. Notice that non-sequential methods are still much slower than methods well suited for screening large datasets, such as SSM. Innovative methods enabling non-sequential searching on large-scale datasets are a great direction for future work.

Supplementary Material

btaf309_Supplementary_Data

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abyzov A , Ilyin VA. A comprehensive analysis of non-sequential alignments between all protein structures. BMC Struct Biol 2007;7:78–20.18005453 10.1186/1472-6807-7-78PMC 2213659 · doi ↗ · pubmed ↗
2Alaya MZ , Berar M, Gasso G et al Screening sinkhorn algorithm for regularized optimal transport. Adv Neural Inform Process Syst 2019;32:12191–201.
3Barrio-Hernandez I , Yeo J, Jänes J et al Clustering predicted structures at the scale of the known protein universe. Nature 2023;622:637–45.37704730 10.1038/s 41586-023-06510-w PMC 10584675 · doi ↗ · pubmed ↗
4Borkakoti N , Thornton JM. Alphafold 2 protein structure prediction: implications for drug discovery. Curr Opin Struct Biol 2023;78:102526.36621153 10.1016/j.sbi.2022.102526 PMC 7614146 · doi ↗ · pubmed ↗
5Brown P , Pullan W, Yang Y et al Fast and accurate non-sequential protein structure alignment using a new asymmetric linear sum assignment heuristic. Bioinformatics 2016;32:370–7.26454279 10.1093/bioinformatics/btv 580 · doi ↗ · pubmed ↗
6Brun L , Gaüzère B, Renton G et al A differentiable approximation for the linear sum assignment problem with edition. In 2022 26th International Conference on Pattern Recognition, 2022, 3822–3828.
7Burkard RE , Cela E. 1999. Linear assignment problems and extensions. In Handbook of Combinatorial Optimization: Supplement Volume A, 75–149. Springer.
8Collier JH , Allison L, Lesk AM et al Statistical inference of protein structural alignments using information and compression. Bioinformatics 2017;33:1005–13.28065899 10.1093/bioinformatics/btw 757 · doi ↗ · pubmed ↗