Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports

Jingwei Huang; Kuroush Nezafati; Ismael Villanueva-Miranda; Zifan Gu; Yueshuang Xu; Ann Marie Navar; Tingyi Wanyan; Qin Zhou; Bo Yao; Ruichen Rong; Xiaowei Zhan; Guanghua Xiao; Eric D. Peterson; Donghan M. Yang; Wenqi Shi; Yang Xie

arXiv:2410.16543·cs.AI·July 22, 2025

Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports

Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie

PDF

Open Access

TL;DR

This paper presents a multiagent ensemble approach powered by large language models to improve data labeling accuracy and reduce hallucinations in large-scale electronic health record datasets, demonstrated on ECG reports and clinical notes.

Contribution

The study introduces an ensemble LLMs method that enhances labeling accuracy and reduces hallucinations in EHR data, outperforming individual models and applicable to various text labeling tasks.

Findings

01

Achieved 98.2% accuracy in labeling MIMIC-IV ECG reports.

02

Reduced hallucination errors compared to individual LLMs.

03

Generalized well to social determinants of health identification.

Abstract

This study introduces a LLMs powered multiagent ensemble method to address challenges in hallucination and data labeling, particularly in large-scale EHR datasets. Manual labeling of such datasets requires domain expertise and is labor-intensive, time-consuming, expensive, and error-prone. To overcome this bottleneck, we developed an ensemble LLMs method and demonstrated its effectiveness in two real-world tasks: (1) labeling a large-scale unlabeled ECG dataset in MIMIC-IV; (2) identifying social determinants of health (SDOH) from the clinical notes of EHR. Trading off benefits and cost, we selected a pool of diverse open source LLMs with satisfactory performance. We treat each LLM's prediction as a vote and apply a mechanism of majority voting with minimal winning threshold for ensemble. We implemented an ensemble LLMs application for EHR data labeling tasks. By using the ensemble LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Semantic Web and Ontologies