PertEval: Unveiling Real Knowledge Capacity of LLMs with   Knowledge-Invariant Perturbations

Jiatong Li; Renjun Hu; Kunzhe Huang; Yan Zhuang; Qi Liu; Mengxiao Zhu,; Xing Shi; Wei Lin

arXiv:2405.19740·cs.CL·October 21, 2024

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu,, Xing Shi, Wei Lin

PDF

Open Access 1 Repo 1 Video

TL;DR

PertEval is a toolkit that uses knowledge-invariant perturbations to more accurately assess the true knowledge capacity of large language models, revealing overestimations in current benchmarks and exposing models' weaknesses.

Contribution

We introduce PertEval, a novel probing toolkit employing human-like restatement perturbations to evaluate LLMs' genuine knowledge, reducing bias from test scenario limitations and data contamination.

Findings

01

LLMs' performance is significantly overestimated on raw benchmarks.

02

PertEval reveals models' uncertainty and rote memorization tendencies.

03

Response consistency analysis uncovers weaknesses in LLMs' knowledge mastery.

Abstract

Expert-designed close-ended benchmarks are indispensable in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through \textbf{knowledge-invariant perturbations}. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of \textbf{response consistency analyses} that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six representative LLMs are re-evaluated using PertEval. Results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aigc-apps/perteval
noneOfficial

Videos

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations· slideslive

Taxonomy

TopicsData Stream Mining Techniques · Semantic Web and Ontologies · Imbalanced Data Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections