Crafting Large Language Models for Enhanced Interpretability
Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng

TL;DR
This paper presents CB-LLM, a large language model designed with inherent interpretability through concept bottlenecks, improving transparency without sacrificing accuracy, and introduces ACC to close performance gaps with black-box models.
Contribution
The paper introduces CB-LLM, a novel interpretable LLM architecture with automatic concept correction, advancing transparency and scalability in language models.
Findings
CB-LLM achieves comparable accuracy to traditional LLMs.
Automatic Concept Correction improves interpretability without performance loss.
CB-LLM enhances transparency and scalability in language models.
Abstract
We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. This innovation not only advances transparency in language models but also enhances their effectiveness. Our unique Automatic Concept Correction (ACC) strategy successfully narrows the performance gap with conventional black-box LLMs, positioning CB-LLM as a model that combines the high accuracy of traditional LLMs with the added benefit of clear interpretability -- a feature markedly absent in existing LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
