BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Frederikke Isa Marin; Felix Teufel; Marc Horlacher; Dennis Madsen,; Dennis Pultz; Ole Winther; Wouter Boomsma

arXiv:2311.12570·q-bio.GN·April 10, 2024·28 cites

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen,, Dennis Pultz, Ole Winther, Wouter Boomsma

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

BEND is a comprehensive benchmark designed to evaluate DNA language models on realistic, biologically meaningful tasks, revealing their strengths and limitations in capturing genomic features.

Contribution

This work introduces BEND, the first standardized benchmark for assessing DNA language models on complex genomic tasks, facilitating consistent evaluation and comparison.

Findings

01

Current DNA LMs approach expert performance on some tasks

02

Embeddings capture limited long-range genomic information

03

BEND provides a realistic assessment of model capabilities

Abstract

The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

- The overarching motivation behind creating BEND, which emphasizes understanding the genome across longer ranges, is commendable. - The efforts to collect the benchmark dataset are commendable, and the dedication shown in running benchmarks for so many language models and supervised baselines is admirable.

Weaknesses

- However, I feel that the task formulation for long sequences lacks depth and isn't entirely persuasive. - When benchmarking DNA Language models, it's crucial to explore the intricacies of training at least one of these models from scratch. This would provide a comprehensive insight into their potential and the limitations of pretraining. - A side-by-side comparison with a DNA Language model trained from scratch is essential. Such an analysis would give a more rounded perspective on the strengt

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

I really liked the section that provided biological background -- it was clear and concise. The benchmark tasks are well described, and each one is important for a DNA language model to be able to address. Some of these tasks are more challenging than the ones used in previous studies.

Weaknesses

Including only tasks from the human genome is problematic. It seems clear that a good language model of DNA should cover more than just one species. The most recent competing benchmark (Gresova 2023) includes eight tasks from three different species. This benchmark does not improve very much over the Gresova benchmark published this year. minor: It would be good to point out, in Section 2.1, that these descriptions are about eukaryotic genomes. Introduce "secondary structure" before using

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

Strengths: - While there exist other recent benchmark datasets for DNA LMs, the authors' datasets are complementary to the existing ones. In particular, the tasks are more realistic than those in the "Genomic benchmarks" paper (but see some caveats below). - The authors provide a comprehensive comparison of existing DNA LM performance on the benchmark datasets; they also provide a comparison to a trained-from-scratch simple supervised model.

Weaknesses

Weaknesses: - The supervised learning baselines are useful, but are rather weak. For example, in the context of histone modifications, a model such as Sei which is trained on much larger datasets has achieved much higher accuracy (although the results are not strictly comparable). I expect such a model to perform much better than the DNA LM approach. - Regarding the enhancer prediction task: This is a case where additional forms of data such as DNA accessibility and DNA contacts are able to

Code & Models

Repositories

frederikkemarin/bend
pytorchOfficial

Videos

BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks· slideslive

Taxonomy

TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies