BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri

TL;DR
BiomedSQL introduces a challenging biomedical reasoning benchmark for text-to-SQL systems, highlighting current limitations and providing a foundation for future research in scientific question answering over knowledge bases.
Contribution
We present BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL over biomedical knowledge bases, with a large dataset and evaluation of state-of-the-art models.
Findings
Models perform significantly below expert baseline.
Current models struggle with domain-specific reasoning.
BiomedSQL reveals gaps in reasoning capabilities of existing systems.
Abstract
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
* The paper addresses a impactful aspect in the field of structured reasoning in science, specifically the ability to use implicit, domain-specific knowledge. * The benchmark is built on a "real-world biomedical knowledge base", harmonizing large, authentic public datasets like OpenTargets, ChEMBL, and GWAS Catalog data. This provides a more realistic challenge than many existing text-to-SQL corpora. * The authors assess performance by reasoning type (like aggregation or multi-table joins) and
* The entire dataset of 68,000 questions is generated from only 40 expert-written templates, meaning quite high homogeneity. * The paper introduces and uses BioScore, an LLM-as-a-judge (using GPT-4o), to evaluate natural language answer quality. Although the authors provide a validation study showing high correlation with a domain expert, this does not capture the full picture of the metric's reliability, as the paper itself acknowledges the "concern over LLM-as-a-judge metrics yielding unstab
- First benchmark explicitly targeting scientific reasoning in text-to-SQL for biomedicine, going beyond syntactic translation in general benchmarks or clinical ones. - Highlights implicit domain conventions like significance thresholds, effect directionality, and multi-omic causal inference, which are critical for real-world biomedical queries but underexplored in prior work. - Large number of data samples with 68,000 triples and a large-scale database (e.g., 21M+ rows in GWAS tables). - Thorou
- Generating 68K samples from 40 seed questions (drawn from CARDBiomedBench) may result in redundant patterns, potentially overestimating model generalization. Is there a specific reason for scaling to 68K rather than a smaller subset? The expansion process appears limited to entity substitution, without syntactic expansions (if I understand correctly), which could reduce real-world variability as seen in crowdsourced benchmarks.
S1: New combined biomedical database that can be used by Text-to-SQL researchers. S2: Clear motivation and paper structure. S3: Dataset seems challenging on evaluated Text-to-SQL approaches
W1: Creating 68K data points from only 40 seed questions in a template based approach will lead to very similar question, just differentiating in small parts (e.g., different value in WHERE filter). W2: The authors show that existing Text-to-SQL systems do not work well on their dataset, which is surprising as with the template based approach the dataset contains lots of repetition and structurally equal questions. A similarity-based few-shot approach should dramatically boost performance. W3:
## Strengths - **Novel and impactful problem scope** Text-to-SQL has been widely studied, but very few benchmarks target scientific reasoning or biomedical domains where implicit conventions (e.g., significance thresholds, trial phases) matter. This benchmark addresses that gap. - **Realistic, large-scale data integration** The authors construct a multi-table biomedical schema from authentic public datasets, which lends realism far beyond toy databases used in Spider-style corpora. -
1. **Security / artifact hygiene** The supplementary materials appear to contain active service-account credentials for BigQuery access. Even if these are limited in scope, publishing any cloud credentials is a critical security issue. 2. **Licensing and redistribution clarity** Several component datasets (e.g., ChEMBL) carry CC BY-NC 4.0 licenses that restrict commercial redistribution. The paper should explicitly list the license for each source table and explain what will be host
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCausal inference · Balanced Selection
