TL;DR
This paper introduces MedSkillAudit, a specialized framework for evaluating the readiness of medical research agent skills, demonstrating improved reliability over human expert review in assessing quality and safety.
Contribution
Developed and preliminarily validated a domain-specific audit framework for medical research agent skills, enhancing assessment reliability and safety before deployment.
Findings
MedSkillAudit achieved ICC(2,1) = 0.449, surpassing human inter-rater ICC of 0.300.
System consensus scores had less divergence than expert disagreement.
Protocol design category showed strong agreement, while academic writing revealed a mismatch.
Abstract
Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
