Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
\'Elo\"ise Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov

TL;DR
This paper demonstrates that the genre of a text prompt can be accurately predicted from LLM activations, aiding interpretability and understanding of model behavior.
Contribution
It introduces a novel framework for predicting text genre from LLM activations, showing high accuracy with shallow classifiers across multiple datasets.
Findings
Genre can be predicted with up to 98% F1-score.
Shallow learning models outperform control tasks.
Results are consistent across different datasets.
Abstract
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Text Readability and Simplification
