Language Models for Controllable DNA Sequence Design
Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji

TL;DR
This paper introduces ATGC-Gen, a transformer-based model for controllable DNA sequence generation that effectively incorporates biological signals to produce diverse, relevant sequences aligned with specified properties.
Contribution
We develop ATGC-Gen, a novel transformer model that integrates biological signals for controllable DNA sequence design, demonstrating improved performance over prior methods.
Findings
ATGC-Gen generates biologically relevant sequences with high controllability.
The model outperforms prior methods in sequence diversity and property alignment.
It effectively models protein binding specificity from ChIP-Seq data.
Abstract
We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
