GenomeQA: Benchmarking General Large Language Models for Genome Sequence Understanding
Weicai Long, Yusen Hou, Junning Feng, Houcheng Su, Shuo Yang, Donglin Xie, Yanlin Zhang

TL;DR
GenomeQA introduces a comprehensive benchmark to evaluate general-purpose large language models on raw genome sequence inference tasks, highlighting their strengths and limitations in biological sequence understanding.
Contribution
It provides a new controlled evaluation benchmark with diverse biological tasks and sequence lengths, enabling systematic study of LLMs in genomics.
Findings
Models outperform random baselines on sequence tasks.
Models exploit local signals like GC content and motifs.
Performance drops on complex, multi-step inference tasks.
Abstract
Large Language Models (LLMs) are increasingly adopted as conversational assistants in genomics, where they are mainly used to reason over biological knowledge, annotations, and analysis outputs through natural language interfaces. However, existing benchmarks either focus on specialized DNA models trained for sequence prediction or evaluate biological knowledge using text-only questions, leaving the behavior of general-purpose LLMs when directly exposed to raw genome sequences underexplored. We introduce GenomeQA, a benchmark designed to provide a controlled evaluation setting for general-purpose LLMs on sequence-based genome inference tasks. GenomeQA comprises 5,200 samples drawn from multiple biological databases, with sequence lengths ranging from 6 to 1,000 base pairs (bp), spanning six task families: Enhancer and Promoter Identification, Splice Site Identification, Taxonomic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
