# Benchmarking large language models for identifying transcription factor regulatory interactions

**Authors:** Lake Noel, Yi-Wen Hsiao, Yimeng He, Andrew Hung, Xiaojiang Cui, Edward Ray, Jason H Moore, Pei-Chen Peng, Xiuzhen Huang

PMC · DOI: 10.1093/bioinformatics/btaf653 · Bioinformatics · 2025-12-12

## TL;DR

This paper benchmarks large language models for identifying how transcription factors regulate genes, showing that some models perform well and can help biologists without advanced computing skills.

## Contribution

The study introduces a benchmarking framework for evaluating LLMs in identifying TF–target interactions using curated and experimental datasets.

## Key findings

- Claude 3.5 Sonnet and GPT-4o achieved the highest balanced accuracies in identifying TF–target interactions.
- Multi-turn prompting improved model performance, especially for self-regulated interactions.
- Processed data and analytical pipeline are publicly available for reproducibility.

## Abstract

Transcription factors (TFs) and their target genes form regulatory networks that control gene expression and influence diverse biological processes and disease outcomes. Although multiple computational methods and curated databases have been developed to identify TF–target interactions, they often require specialized expertise. Large language models (LLMs) chatbots offer a more accessible alternative for querying TF–target interactions. In this study, we benchmarked four prominent LLMs, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.0 Pro, OpenAI’s GPT-4o, and Meta’s Llama3 8b, using 8432 literature-curated human TF–target interactions. We examined four regulatory categories: bidirectional, ambiguous, self-regulated, and unidirectional interactions.

Under single-turn queries, Claude 3.5 Sonnet and GPT-4o outperformed the others, with balanced accuracies reaching 50.0 ± 7.6% (GPT-4o, self-regulated) and 48.2 ± 1.0% (Claude 3.5 Sonnet, unidirectional). Zero-temperature settings generally enhanced reproducibility, and multi-turn prompting improved performance for most models, increasing Claude 3.5 Sonnet’s accuracy on self-regulated pairs by 32.6%. Excluding TF–target pairs with all unknown regulation types also generally improved accuracy, with unidirectional regulation reaching near 70% balanced accuracy in some cases. We also benchmarked Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 2.0 Flash, OpenAI’s GPT-4o, and Meta’s Llama3 using 5148 experimentally derived TF–target interactions. Claude 3.5 Sonnet consistently outperformed the other models across conditions. Our findings highlight that prompt engineering and strategic use of model parameters consistently influence LLM chatbots’ performance on TF–target identifications. This study establishes a benchmarking framework and demonstrates the potential of pre-trained general-purpose LLMs to support regulatory biology research, especially for researchers without extensive computational expertise.

The literature-based TF–target interactions ground truth were obtained from TRRUST v2 human dataset (www.grnpedia.org/trrust). The experimental derived TF–target interactions ground truth were obtained from TFLink Home Sapiens small-scale interaction table (https://tflink.net/). Processed TF–target interactions data and the analytical pipeline has been compiled as an interactive Python notebook file and is available at https://github.com/pengpclab/LLM-TF-interactions.

## Full-text entities

- **Genes:** F3 (coagulation factor III, tissue factor) [NCBI Gene 2152] {aka CD142, TF, TFA}
- **Chemicals:** Sonnet (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12766914/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12766914/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12766914/full.md

---
Source: https://tomesphere.com/paper/PMC12766914