Large Language Models for Variant-Centric Functional Evidence Mining
Ali Saadat, Jacques Fellay

TL;DR
This paper introduces a benchmark and pipeline leveraging large language models to automate and scale the curation of functional genomic evidence from literature, improving accuracy and efficiency.
Contribution
It presents a new benchmark for evaluating LLMs on evidence curation tasks and develops AcmGENTIC, an end-to-end pipeline for literature retrieval and evidence extraction.
Findings
Both models achieved high recall in abstract screening.
O4-mini achieved 96% accuracy in full text evidence classification.
The pipeline automates literature retrieval, evidence extraction, and report generation.
Abstract
Functional evidence is essential for clinical interpretation of genomic variants, but identifying relevant studies and translating experimental results into structured evidence remains labor intensive. We developed a benchmark based on ClinGen curated annotations to evaluate two large language models (LLMs), a non reasoning model (gpt-4o-mini) and a reasoning model (o4-mini), on tasks relevant to functional evidence curation: (1) abstract screening to determine whether a study reports functional experiments directly testing specific variants, and (2) full text evidence extraction and classification from matched variant-paper pairs, including interpretation of evidence direction and generation of evidence summaries. Starting from ClinGen variants annotated with functional evidence, we processed curator comments with an LLM to extract PubMed identifiers, evidence labels, and narrative,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
