TL;DR
GlitchProber is a novel tool that detects and mitigates glitch tokens in large language models, improving their reliability by analyzing attention patterns and intermediate layer features.
Contribution
This work introduces GlitchProber, a new method combining sampling, PCA, and classification to efficiently identify and rectify glitch tokens in LLMs.
Findings
Achieves an average F1 score of 0.86 in glitch token detection
Demonstrates higher efficiency and precision over existing methods
Reduces destructive effects of glitch tokens in multiple LLMs
Abstract
Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them "glitch tokens". Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
