Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing
Zonghai Yao, Yi Cao, Zhichao Yang, Hong Yu

TL;DR
This paper introduces context variance prompts and a new evaluation metric to improve the reliability of probing biomedical knowledge in pretrained language models, addressing biases and challenges in existing methods.
Contribution
It proposes a novel context variance approach and the UCM metric, enhancing the evaluation of PLMs' biomedical knowledge, especially for large-N-M and rare relations.
Findings
Context variance prompts improve robustness in knowledge probing.
UCM metric captures model understanding beyond simple recall.
Enhanced evaluation stability for large-N-M and rare relations.
Abstract
Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs' knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric. Different from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Healthcare
MethodsTanh Activation · Softmax · Low-Rank Factorization-based Multi-Head Attention
