Count-Based Approaches Remain Strong: A Benchmark Against Transformer and LLM Pipelines on Structured EHR
Jifan Gao, Michael Rosenthal, Brian Wolpin, Simona Cristea

TL;DR
This study benchmarks count-based models against transformer and LLM pipelines for structured EHR prediction, finding count-based methods remain competitive due to their simplicity and interpretability.
Contribution
It provides a direct comparison of count-based, transformer, and LLM pipeline methods on EHR data, highlighting the continued strength of count-based approaches.
Findings
Count-based models perform competitively with LLM pipelines.
No single method dominates across all tasks.
Count-based models offer simplicity and interpretability.
Abstract
Structured electronic health records (EHR) are essential for clinical prediction. While count-based learners continue to perform strongly on such data, no benchmarking has directly compared them against more recent mixture-of-agents LLM pipelines, which have been reported to outperform single LLMs in various NLP tasks. In this study, we evaluated three categories of methodologies for EHR prediction using the EHRSHOT dataset: count-based models built from ontology roll-ups with two time bins, based on LightGBM and the tabular foundation model TabPFN; a pretrained sequential transformer (CLMBR); and a mixture-of-agents pipeline that converts tabular histories to natural-language summaries followed by a text classifier. We assessed eight outcomes using the EHRSHOT dataset. Across the eight evaluation tasks, head-to-head wins were largely split between the count-based and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Electronic Health Records Systems
