GlitchProber: Advancing Effective Detection and Mitigation of Glitch   Tokens in Large Language Models

Zhibo Zhang; Wuxia Bai; Yuxi Li; Mark Huasong Meng; Kailong Wang; Ling; Shi; Li Li; Jun Wang; Haoyu Wang

arXiv:2408.04905·cs.CL·September 24, 2024

GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling, Shi, Li Li, Jun Wang, Haoyu Wang

PDF

1 Repo

TL;DR

GlitchProber is a novel tool that detects and mitigates glitch tokens in large language models, improving their reliability by analyzing attention patterns and intermediate layer features.

Contribution

This work introduces GlitchProber, a new method combining sampling, PCA, and classification to efficiently identify and rectify glitch tokens in LLMs.

Findings

01

Achieves an average F1 score of 0.86 in glitch token detection

02

Demonstrates higher efficiency and precision over existing methods

03

Reduces destructive effects of glitch tokens in multiple LLMs

Abstract

Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them "glitch tokens". Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-integrity-guard/glitchprober
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need