Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models
Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora, Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

TL;DR
This paper introduces an automatic method to analyze model performance on specific skills within benchmarks, revealing nuanced trade-offs and enabling targeted improvements by inspecting model-generated rationales.
Contribution
The authors propose a novel skill-recovery approach using rationales, enabling detailed skill-wise evaluation across multiple benchmarks and models, which was previously obscured by aggregate metrics.
Findings
Identified common skills across benchmarks, creating skill-slices for detailed analysis.
Revealed significant performance trade-offs between models on specific skills.
Demonstrated improved accuracy by routing instances based on skill strengths.
Abstract
With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for k instances over benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is more accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTunneling and Rock Mechanics
