Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Mohammadreza Ghaffarzadeh-Esfahani; Mahdi Ghaffarzadeh-Esfahani; Arian Salahi-Niri; Hossein Toreyhi; Zahra Atf; Amirali Mohsenzadeh-Kermani; Mahshad Sarikhani; Zohreh Tajabadi; Fatemeh Shojaeian; Mohammad Hassan Bagheri; Aydin Feyzi; Mohammadamin Tarighatpayma; Narges Gazmeh; Fateme Heydari; Hossein Afshar; Amirreza Allahgholipour; Farid Alimardani; Ameneh Salehi; Naghmeh Asadimanesh; Mohammad Amin Khalafi; Hadis Shabanipour; Ali Moradi; Sajjad Hossein Zadeh; Omid Yazdani; Romina Esbati; Moozhan Maleki; Danial Samiei Nasr; Amirali Soheili; Hossein Majlesi; Saba Shahsavan; Alireza Soheilipour; Nooshin Goudarzi; Erfan Taherifard; Hamidreza Hatamabadi; Jamil S Samaan; Thomas Savage; Ankit Sakhuja; Ali Soroush; Girish Nadkarni; Ilad Alavi Darazam; Mohamad Amin Pourhoseingholi; Seyed Amir Ahmad Safavi-Naini

arXiv:2409.02136·cs.LG·April 10, 2026·2 cites

Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Mohammadreza Ghaffarzadeh-Esfahani, Mahdi Ghaffarzadeh-Esfahani, Arian Salahi-Niri, Hossein Toreyhi, Zahra Atf, Amirali Mohsenzadeh-Kermani, Mahshad Sarikhani, Zohreh Tajabadi, Fatemeh Shojaeian, Mohammad Hassan Bagheri, Aydin Feyzi, Mohammadamin Tarighatpayma, Narges Gazmeh

PDF

TL;DR

This study compares classical machine learning models and large language models in predicting COVID-19 mortality from high-dimensional tabular data, highlighting the strengths and limitations of each approach.

Contribution

It demonstrates that fine-tuning LLMs can significantly improve their performance, but classical models still outperform LLMs on structured data tasks.

Findings

01

XGBoost and RF achieved F1 scores of 0.87 and 0.83.

02

GPT-4 achieved an F1 score of 0.43 in zero-shot classification.

03

Fine-tuning Mistral-7b increased recall from 1% to 79%, with a stable F1 of 0.74.

Abstract

This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral- 7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.