Preparing to Integrate Generative Pretrained Transformer Series 4 models   into Genetic Variant Assessment Workflows: Assessing Performance, Drift, and   Nondeterminism Characteristics Relative to Classifying Functional Evidence in   Literature

Samuel J. Aronson (1,2); Kalotina Machini (1,3); Jiyeon Shin (2),; Pranav Sriraman (1); Sean Hamill (4); Emma R. Henricks (1); Charlotte Mailly; (1,2); Angie J. Nottage (1); Sami S. Amr (1,3); Michael Oates (1,2); Matthew; S. Lebo (1,3) ((1) Mass General Brigham Personalized Medicine; (2); Accelerator for Clinical Transformation; Mass General Brigham; (3) Department; of Pathology; Brigham; Women's Hospital; (4) Microsoft Corporation)

arXiv:2312.13521·q-bio.GN·February 20, 2024·1 cites

Preparing to Integrate Generative Pretrained Transformer Series 4 models into Genetic Variant Assessment Workflows: Assessing Performance, Drift, and Nondeterminism Characteristics Relative to Classifying Functional Evidence in Literature

Samuel J. Aronson (1,2), Kalotina Machini (1,3), Jiyeon Shin (2),, Pranav Sriraman (1), Sean Hamill (4), Emma R. Henricks (1), Charlotte Mailly, (1,2), Angie J. Nottage (1), Sami S. Amr (1,3), Michael Oates (1,2), Matthew, S. Lebo (1

PDF

Open Access

TL;DR

This study evaluates GPT-4's performance, variability, and stability in classifying functional evidence in genetic variant literature, highlighting the importance of monitoring nondeterminism and drift for clinical application.

Contribution

It provides an analysis of GPT-4's performance and variability over time in a clinical text classification task, informing its integration into genetic variant assessment workflows.

Findings

01

GPT-4 achieved 92.2% sensitivity in identifying articles with functional evidence

02

Performance variability decreased after January 18, 2024

03

Nondeterminism and drift significantly impact GPT-4's reliability in clinical tasks

Abstract

Background. Large Language Models (LLMs) hold promise for improving genetic variant literature review in clinical testing. We assessed Generative Pretrained Transformer 4's (GPT-4) performance, nondeterminism, and drift to inform its suitability for use in complex clinical processes. Methods. A 2-prompt process for classification of functional evidence was optimized using a development set of 45 articles. The prompts asked GPT-4 to supply all functional data present in an article related to a variant or indicate that no functional evidence is present. For articles indicated as containing functional evidence, a second prompt asked GPT-4 to classify the evidence into pathogenic, benign, or intermediate/inconclusive categories. A final test set of 72 manually classified articles was used to test performance. Results. Over a 2.5-month period (Dec 2023-Feb 2024), we observed substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Rare Diseases · Biomedical Text Mining and Ontologies

MethodsSparse Evolutionary Training · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Residual Connection · Dropout · Label Smoothing