MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Yingyong Hou; Xinyuan Lao; Huimei Wang; Qianyu Yao; Wei Chen; Bocheng Huang; Fei Sun; Yuxian Lv; Weiqi Lei; Xueqian Wen; Pengfei Xia; Zhujun Tan; Shengyang Xie

arXiv:2604.20441·cs.AI·April 23, 2026

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Yingyong Hou, Xinyuan Lao, Huimei Wang, Qianyu Yao, Wei Chen, Bocheng Huang, Fei Sun, Yuxian Lv, Weiqi Lei, Xueqian Wen, Pengfei Xia, Zhujun Tan, Shengyang Xie

PDF

1 Repo

TL;DR

This paper introduces MedSkillAudit, a specialized framework for evaluating the readiness of medical research agent skills, demonstrating improved reliability over human expert review in assessing quality and safety.

Contribution

Developed and preliminarily validated a domain-specific audit framework for medical research agent skills, enhancing assessment reliability and safety before deployment.

Findings

01

MedSkillAudit achieved ICC(2,1) = 0.449, surpassing human inter-rater ICC of 0.300.

02

System consensus scores had less divergence than expert disagreement.

03

Protocol design category showed strong agreement, while academic writing revealed a mismatch.

Abstract

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aipoch/medical-research-skills
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.