# EC2Seq2Sql: Patient-trial matching with LLM agents

**Authors:** Liu Yang, Yongzhong Han, Liang Liu, Xiaoyan Jiang, Ying Li, Jihan Huang, Qianmin Su

PMC · DOI: 10.1371/journal.pone.0341827 · PLOS One · 2026-02-12

## TL;DR

EC2Seq2Sql is a framework that automatically converts clinical trial eligibility criteria into SQL queries to identify eligible patients from EHR data, reducing manual screening effort.

## Contribution

The novel two-stage framework combines a BART-based parser and an LLM agent to convert narrative criteria into executable SQL for EHR-based patient screening.

## Key findings

- The BART parser achieved ROUGE_L 0.8067 and BLEU 0.8427, outperforming Seq2Seq baselines.
- The SQL generation stage reached an exact-match accuracy of 0.84 and execution accuracy of 0.91.
- On a real-world cohort, generated queries achieved a clinical match accuracy of 0.88 after expert review.

## Abstract

Timely identification of patients who meet clinical trial eligibility criteria is a persistent bottleneck in trial recruitment because the criteria are written in flexible natural language, while hospital EHRs are stored in structured schemas. To bridge this gap, we propose EC2Seq2Sql, an end-to-end, two-stage framework that automatically converts narrative eligibility criteria into executable SQL queries for EHR-based patient screening. In the first stage, a BART-based semantic parser transforms free-text trial criteria into lightweight structured pattern sequences defined over seven common clinical domains. In the second stage, an LLM-based agent, guided by system- and human-designed prompts, grounds these structured patterns to the target database schema and generates syntactically valid and logically coherent SQL statements. We evaluated the framework on the ClinicalTrials.gov eligibility-criteria dataset and further validated it on a de-identified real-world hepatocellular carcinoma EHR cohort from Zhongshan Hospital, Fudan University. The BART parser outperformed representative Seq2Seq baselines, achieving ROUGE_L 0.8067 and BLEU 0.8427, while the SQL generation stage reached an exact-match accuracy of 0.84 and an execution accuracy of 0.91 after SQL normalization. On the real-world cohort, the generated queries achieved a clinical match accuracy of 0.88 after expert review, indicating that the proposed pipeline can retrieve trial-eligible patients from operational EHR data. These results suggest that EC2Seq2Sql can substantially reduce manual screening effort and provide a reproducible path from narrative criteria to database-level cohort identification, although broader multi-center validation and ontology-based normalization will be needed for large-scale deployment.

## Linked entities

- **Diseases:** hepatocellular carcinoma (MONDO:0007256)

## Full-text entities

- **Diseases:** hepatocellular carcinoma (MESH:D006528)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12900307/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12900307/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12900307/full.md

---
Source: https://tomesphere.com/paper/PMC12900307