BEND: Benchmarking DNA Language Models on biologically meaningful tasks
Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen,, Dennis Pultz, Ole Winther, Wouter Boomsma

TL;DR
BEND is a comprehensive benchmark designed to evaluate DNA language models on realistic, biologically meaningful tasks, revealing their strengths and limitations in capturing genomic features.
Contribution
This work introduces BEND, the first standardized benchmark for assessing DNA language models on complex genomic tasks, facilitating consistent evaluation and comparison.
Findings
Current DNA LMs approach expert performance on some tasks
Embeddings capture limited long-range genomic information
BEND provides a realistic assessment of model capabilities
Abstract
The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that…
Peer Reviews
Decision·ICLR 2024 poster
- The overarching motivation behind creating BEND, which emphasizes understanding the genome across longer ranges, is commendable. - The efforts to collect the benchmark dataset are commendable, and the dedication shown in running benchmarks for so many language models and supervised baselines is admirable.
- However, I feel that the task formulation for long sequences lacks depth and isn't entirely persuasive. - When benchmarking DNA Language models, it's crucial to explore the intricacies of training at least one of these models from scratch. This would provide a comprehensive insight into their potential and the limitations of pretraining. - A side-by-side comparison with a DNA Language model trained from scratch is essential. Such an analysis would give a more rounded perspective on the strengt
I really liked the section that provided biological background -- it was clear and concise. The benchmark tasks are well described, and each one is important for a DNA language model to be able to address. Some of these tasks are more challenging than the ones used in previous studies.
Including only tasks from the human genome is problematic. It seems clear that a good language model of DNA should cover more than just one species. The most recent competing benchmark (Gresova 2023) includes eight tasks from three different species. This benchmark does not improve very much over the Gresova benchmark published this year. minor: It would be good to point out, in Section 2.1, that these descriptions are about eukaryotic genomes. Introduce "secondary structure" before using
Strengths: - While there exist other recent benchmark datasets for DNA LMs, the authors' datasets are complementary to the existing ones. In particular, the tasks are more realistic than those in the "Genomic benchmarks" paper (but see some caveats below). - The authors provide a comprehensive comparison of existing DNA LM performance on the benchmark datasets; they also provide a comparison to a trained-from-scratch simple supervised model.
Weaknesses: - The supervised learning baselines are useful, but are rather weak. For example, in the context of histone modifications, a model such as Sei which is trained on much larger datasets has achieved much higher accuracy (although the results are not strictly comparable). I expect such a model to perform much better than the DNA LM approach. - Regarding the enhancer prediction task: This is a case where additional forms of data such as DNA accessibility and DNA contacts are able to
Code & Models
Videos
Taxonomy
TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
