HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens
Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan

TL;DR
HD-Prot introduces a hybrid diffusion-based protein language model that integrates continuous structural information with sequence data, enhancing multimodal protein modeling without losing fine-grained structural details.
Contribution
It presents a novel approach combining discrete sequence tokens with continuous structure tokens using diffusion, enabling joint modeling and improving performance in various protein tasks.
Findings
Achieves competitive results in sequence-structure co-generation
Performs well in motif-scaffolding and structure prediction
Operates efficiently with limited computational resources
Abstract
Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Bioinformatics · Machine Learning in Materials Science
