Characterizing Large Language Model Geometry Helps Solve Toxicity   Detection and Generation

Randall Balestriero; Romain Cosentino; Sarath Shekkizhar

arXiv:2312.01648·cs.AI·July 12, 2024·2 cites

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Randall Balestriero, Romain Cosentino, Sarath Shekkizhar

PDF

Open Access 1 Repo

TL;DR

This paper explores the internal geometry of large language models to understand their representations, enabling new methods for toxicity detection and manipulation of model outputs.

Contribution

It introduces a geometric framework for analyzing LLMs, deriving intrinsic dimensions and affine mappings, and applies these insights to improve toxicity detection and prompt manipulation.

Findings

01

Intrinsic dimension of attention embeddings is characterized.

02

Geometric features can identify and classify toxicity.

03

Controlling embedding dimensions can bypass RLHF protections.

Abstract

Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

randallbalestriero/splinellm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Linear Layer · Jigsaw