NewTerm: Benchmarking Real-Time New Terms for Large Language Models with   Annual Updates

Hexuan Deng; Wenxiang Jiao; Xuebo Liu; Min Zhang; Zhaopeng Tu

arXiv:2410.20814·cs.CL·October 29, 2024

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces NewTerm, an adaptive benchmark for evaluating large language models on real-time new terms, highlighting their performance decline and analyzing challenges in updating models with recent information.

Contribution

We propose a highly automated, annually updated benchmark for real-time new term evaluation, addressing limitations of existing outdated content benchmarks.

Findings

01

LLMs experience over 20% performance drop on new terms

02

Knowledge cutoff updates only partially cover new terms

03

Certain types of new terms are more challenging for LLMs

Abstract

Despite their remarkable abilities in various tasks, large language models (LLMs) still struggle with real-time information (e.g., new facts and terms) due to the knowledge cutoff in their development process. However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hexuandeng/newterm
noneOfficial

Datasets

hexuandeng/NewTerm
dataset· 4 dl
4 dl

Videos

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates· slideslive

Taxonomy

TopicsTopic Modeling

MethodsFocus