BSM: Small but Powerful Biological Sequence Model for Genes and Proteins
Weixi Xiang, Xueting Han, Xiujuan Chai, Jing Bai

TL;DR
BSM is a compact, multimodal biological sequence model that effectively learns cross-modal relationships among DNA, RNA, and proteins, achieving high performance with fewer parameters and demonstrating in-context learning capabilities.
Contribution
Introduction of BSM, a small yet powerful mixed-modal biological sequence model trained on diverse datasets, enhancing cross-modal understanding and efficiency.
Findings
BSM with 110M parameters matches larger models in performance.
BSM demonstrates in-context learning for mixed-modal tasks.
Scaling to 270M parameters yields further improvements.
Abstract
Modeling biological sequences such as DNA, RNA, and proteins is crucial for understanding complex processes like gene regulation and protein synthesis. However, most current models either focus on a single type or treat multiple types of data separately, limiting their ability to capture cross-modal relationships. We propose that by learning the relationships between these modalities, the model can enhance its understanding of each type. To address this, we introduce BSM, a small but powerful mixed-modal biological sequence foundation model, trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web. These datasets capture the genetic flow, gene-protein relationships, and the natural co-occurrence of diverse biological data, respectively. By training on mixed-modal data, BSM significantly enhances learning efficiency and cross-modal…
Peer Reviews
Decision·Submitted to ICLR 2025
I like the general concept, there are myriad types of data that can be used for biological sequence modeling, and I'd like to believe that being smart about data/modality selection can lead to increased performance with fewer parameters/FLOPs by leveraging multiple types of information at once.
1. It is not clear what the exact effect of each phase (data mix) is. 2. The BSM model is split up into three phases: 3. Phase 1 trained on 100B single modal tokens (e.g each sequence is only nucleic acids or only protein sequences). 4. Phase 2 is further trained on 146B tokens for 246B tokens seen in total 5. Phase 3 is then further trained on $\sim$ 16.9B more tokens for 262.2B tokens seen in total. 3. It is unclear to this reviewer what perfromance gains are
A compact mixed-modal biological sequence foundation model is introduced and assessed across various downstream tasks.
The clarity and precision of the paper could be greatly enhanced by making it more concise. For example: • The paper contains numerous repetitive phrases, sentences, and references that could be removed to improve readability. • Data and experimental details would be better summarized in tables for clarity and ease of comparison. • Section 4 is largely redundant. • Many figures are of poor quality, with barely readable legends. Drawing conclusions from the experiments presented in the paper
- Modeling biological sequence and integrating heterogeneous types of sequences are important in building foundational model for biological sequences. - Extensive experiments are conducted to demonstrate the superiority of BSM
- The main motivation of this paper is to use the small language model (SLM) for biological sequences. But why do we need SLM for biological sequence analysis? SLM is usually practical in real-life for implementing language model for on device or resource limited settings. However, in biological sequence analysis, scientists may already have intensive computational resources. It would be great if the authors can justify why we need SLM for biological sequence analysis. - It may be convincing, if
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research · Machine Learning in Bioinformatics
MethodsFocus
