A Course Shared Task on Evaluating LLM Output for Clinical Questions

Yufang Hou; Thy Thy Tran; Doan Nam Long Vu; Yiwen Cao; Kai Li; Lukas; Rohde; Iryna Gurevych

arXiv:2408.00122·cs.CL·August 2, 2024

A Course Shared Task on Evaluating LLM Output for Clinical Questions

Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas, Rohde, Iryna Gurevych

PDF

Open Access 1 Repo

TL;DR

This paper introduces a shared task designed to evaluate how well large language models generate safe and accurate responses to clinical health questions, aiming to improve NLP education and model assessment.

Contribution

It presents a novel educational shared task on evaluating LLM outputs for clinical questions, including design considerations and student feedback.

Findings

01

Student feedback on task design

02

Insights into LLM output evaluation

03

Relevance for NLP education

Abstract

This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UKPLab/folt-shared-task-23-24
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Artificial Intelligence in Law