Efficient Protein Optimization via Structure-aware Hamiltonian Dynamics
Jiahao Wang, Shuangjia Zheng

TL;DR
HADES is a novel structure-aware Bayesian optimization method using Hamiltonian dynamics to efficiently design protein sequences with desired properties by integrating structural constraints and uncertainty modeling.
Contribution
The paper introduces HADES, a new optimization framework that combines Hamiltonian dynamics and structure-aware modeling for protein engineering, outperforming existing methods.
Findings
Outperforms state-of-the-art baselines in in-silico evaluations
Leverages structure-sequence mutual constraints for better design
Efficiently samples promising protein variants using Hamiltonian dynamics
Abstract
The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence-based optimization methods struggle with the high-dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure-aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such a continuous state system. The posterior surrogate is powered by a two-stage encoder-decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The method achieves enhanced performance with fewer sampling steps when tested on two proteins.
1. The benchmarking is not comprehensive. Experiments are limited to only two proteins (GB1 and PhoQ). More extensive testing is needed to demonstrate the efficacy of the proposed method. 2. The Method section is poorly organized and lacks key details. For example: a. There are no details on Bayesian optimization, although it's mentioned in Figure 1 and the Introduction. b. The details on the sequence encoder are not described. Do the input features include relative positional encoding?
*Originality*: While HMC is very well-studied in general, and a variety of sequence encoder/decoders have been used as proxies for Bayesian optimization in protein sequences, this combination is (to my knowledge) new. The intuition to using a structure decoder is physically sound and worth exploring (notwithstanding some limitation described below). *Quality* & *Clarity*: The presentation of the paper is very clear and easy to follow. The analysis is reasonably thorough and well-motivated. The
1. One of the main issues with this paper is the lack of recent baselines and a limited range of tasks with varying difficulties. While HADES appears to be more efficient, it only _marginally_ outperforms the existing baseline methods. Table 3, for example, shows that much of the performance gain (e.g., compared to PEX) can be attributed to an improved surrogate model rather than HMC itself. Moreover, recent literature has demonstrated significant improvements over these baselines. For instance,
The paper is clearly written. The encoder-decoder architecture aims to distill structure relatedness into the resulting surrogate fitness scores albeit only through shared latent embedding. Nevertheless, bringing some (latent) structural information into sequence optimization seems like a good idea.
Lots of space is used to discuss Hamiltonian dynamics though this is not strictly speaking followed. HMC(q,f) randomizes the momentum for each call, performs L updates of all residues, starting with the current seed q, accepting each update & its associated discretization with MH. The random momentum moves the system in random direction though remains guided by the potential energy that is defined as -log(P(f(q))). One would think that it would be advantageous to move in the continuous space (re
* Using Hamiltonian dynamics is a novel contribution. * HADES outperforms all chosen baselines on GB1 and PhoQ.
* The chosen datasets, GB1 and PhoQ, are toyish since they only require mutating up to 4 residues. This is a small search spaces compared to other protein engineering benchmarks such as AAV and GFP [1] that are commonly used in many works. Even the referenced work [2] evaluates on GFP but this dataset is not used. I understand GB1/PhoQ are desirable since they don't require training oracles but there should still be evaluation of realistic protein engineering tasks such as AAV and GFP on top of
Videos
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Bioinformatics · Advanced Multi-Objective Optimization Algorithms
