BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science
Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan, Guo, Stan Z. Li, Kaicheng Yu

TL;DR
BioKGBench introduces a novel benchmark for evaluating AI agents in biomedical science by assessing their ability to understand scientific literature and verify facts using knowledge graphs, revealing significant gaps in current agent performance.
Contribution
The paper proposes a new benchmark, BioKGBench, that evaluates biomedical AI agents on scientific claim verification and knowledge graph question-answering, addressing limitations of existing QA-based assessments.
Findings
State-of-the-art agents perform poorly on the benchmark.
Over 90 factual errors found in a popular knowledge graph.
The simple BKGAgent baseline shows promising results.
Abstract
Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from one most important abilities of scientists, understanding the literature, and introduce BioKGBench. In contrast to traditional evaluation benchmark that only focuses on factual QA, where the LLMs are known to have hallucination issues, we first disentangle "Understanding Literature" into two atomic abilities, i) "Understanding" the unstructured text from research papers by performing scientific claim…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Artificial Intelligence in Healthcare and Education
