Unearthing Skill-Level Insights for Understanding Trade-Offs of   Foundation Models

Mazda Moayeri; Vidhisha Balachandran; Varun Chandrasekaran; Safoora; Yousefi; Thomas Fel; Soheil Feizi; Besmira Nushi; Neel Joshi; Vibhav Vineet

arXiv:2410.13826·cs.LG·October 25, 2024

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora, Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

PDF

Open Access

TL;DR

This paper introduces an automatic method to analyze model performance on specific skills within benchmarks, revealing nuanced trade-offs and enabling targeted improvements by inspecting model-generated rationales.

Contribution

The authors propose a novel skill-recovery approach using rationales, enabling detailed skill-wise evaluation across multiple benchmarks and models, which was previously obscured by aggregate metrics.

Findings

01

Identified common skills across benchmarks, creating skill-slices for detailed analysis.

02

Revealed significant performance trade-offs between models on specific skills.

03

Demonstrated improved accuracy by routing instances based on skill strengths.

Abstract

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$ k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18%$ more accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTunneling and Rock Mechanics