COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain
Dimitrios P. Panagoulias, Persephone Papatheodosiou, Anastasios, P. Palamidas, Mattheos Sanoudos, Evridiki Tsoureli-Nikita, Maria, Virvou, George A. Tsihrintzis

TL;DR
This paper introduces COGNET-MD, a new evaluation framework and dataset for benchmarking Large Language Models in the medical domain, focusing on interpretative ability and safety through a challenging scoring system and expert-constructed MCQs.
Contribution
It presents a novel, domain-specific benchmark with a scoring framework and a curated MCQ dataset for assessing LLMs in medical contexts, including multiple specialties.
Findings
Benchmark includes diverse medical domains.
MCQ dataset constructed with medical experts.
Framework emphasizes interpretative difficulty and safety.
Abstract
Large Language Models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence (AI) technology which is rapidly evolving and promises to aid in medical diagnosis either by assisting doctors or by simulating a doctor's workflow in more advanced and complex implementations. In this technical paper, we outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD), which constitutes a novel benchmark for LLM evaluation in the medical domain. Specifically, we propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text. The proposed framework is accompanied with a database of Multiple Choice Quizzes (MCQs). To ensure alignment with current medical trends and enhance safety, usefulness, and applicability, these MCQs have been constructed in collaboration with several associated medical experts in various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Radiomics and Machine Learning in Medical Imaging
