Uncovering Competency Gaps in Large Language Models and Their Benchmarks
Matyas Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

TL;DR
This paper introduces a novel unsupervised method using sparse autoencoders to identify specific competency gaps in large language models and their benchmarks, revealing weaknesses and imbalances at the concept level.
Contribution
The authors propose a new autoencoder-based approach for uncovering model and benchmark gaps, enabling detailed concept-level evaluation grounded in internal representations.
Findings
Models underperform on safety and boundary-related concepts.
Benchmarks over-represent obedience and instruction-following concepts.
The method successfully recovers known gaps without manual supervision.
Abstract
The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model's internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic…
Peer Reviews
Decision·Submitted to ICLR 2026
It is a tool that could turn useful to people in the interpretability field. As stated before, it confirms expected behaviors without any supervision, supporting the soundness of the method... ...but at the same time in this paper does not emerge anything particularly new. The latter is not a Strength, but I think also not a Weakness. Honestly I find this structured format of reviews annoying, it is more difficult to write an organic judgement.
The main problem I have with this work is its writing. It is heavy. The overabundance of references sometimes interrupts the conceptual through-line. Papers should be written to be read, not merely to accompany code, and here the balance leans too far towards the latter. Also the work remains confined to its enclave. It does not try to speak beyond those already initiated into SAEs, and the writing reflects this inward posture, should be more self-contained, and just more pleasant to read. A pa
**1. Introducing internal latent features for evaluation framework** > This paper raises an interesting point that model evaluation does not always need to follow human-defined cognitive categories. Instead, it explores assessing capabilities through the model’s own latent features. Introducing SAEs in this context is a reasonable and promising direction, as it provides a way to diagnose models based on the representations they learn. **2. Providing a meta-level perspective on benchmark covera
**1. Gap between stated motivation and actual research objective** > Benchmarks are ultimately meant to communicate and compare core abilities at a broader level for a broad audience, not to enumerate every internal concept or serve primarily as a debugging tool. As a result, while this approach is useful for detailed model analysis, it does not resolve the aggregation problem highlighted in the motivation and is perceived as a diagnostic tool, rather than a practical framework for advancing be
- This paper introduces a use of Sparse Autoencoders (SAEs) to analyze benchmark coverage and model competence at the concept level, moving beyond aggregate accuracy metrics. The framework provides a fresh interpretability-driven lens on how benchmark distributions shape perceived model performance—an idea with clear originality and conceptual significance. - The use of sparse autoencoders to quantify benchmark and model “concept coverage” is well-formulated and technically sound, providing a co
- Experiments are restricted to two medium-sized instruction-tuned models (Gemma2-2B and Llama3.1-8B). Without results from smaller or larger models, the claimed “systematic competency gaps” may reflect architecture-specific or fine-tuning artifacts rather than generalizable phenomena. - The proposed method assumes that SAE activations correspond to stable, human-interpretable concepts. However, the paper does not validate this assumption—for example, through multiple SAE runs, layer sensitivity
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
