Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Dingli Yu; Simran Kaur; Arushi Gupta; Jonah Brown-Cohen; Anirudh; Goyal; Sanjeev Arora

arXiv:2310.17567·cs.CL·October 27, 2023·6 cites

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh, Goyal, Sanjeev Arora

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces Skill-Mix, a novel evaluation method for AI models that measures their ability to flexibly combine learned skills, revealing capabilities beyond traditional benchmarks and suggesting models can synthesize new skill combinations.

Contribution

The work develops a new evaluation framework called Skill-Mix, including methodology and automated grading, to assess AI models' skill combination abilities, which are not captured by existing benchmarks.

Findings

01

GPT-4 performs reasonably on Skill-Mix, indicating advanced skill combination.

02

Significant differences among models are observed beyond leaderboard rankings.

03

Results suggest models can synthesize new skills beyond training data.

Abstract

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^{k}$ , for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

* LLM evaluation (beyond statistical language modelling to general purpose planning systems) is an important and timely topic * In the empirical evaluation the authors have clearly made an effort to compare a wide range of contemporary LLMs that currently score well on leaderboards * The key concept behind the skill-mix framework is easy to understand

Weaknesses

* The motivation for this evaluation framework needs further development. Sec 1. highlights accelerating rates of leaderboard saturation, training set contamination, and training corpora secrecy as pressing issues, and suggests 7 desiderata for new LLM evaluation frameworks; (a) relevant to general-purpose intelligence, (b) easy for humans to design and administer, (c) resistant to training-set contamination, (d) capable of revealing novelty in some sense, (e) easy to grade at scale, (f) easy to

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

I found this paper one of the best papers I reviewed recently (including the Neurips reviewing). I found the Introduction (Section 1) and especially "Desiderata for next-generation evaluations" extremely valuable for any LLM development. I am currently working on a task that requires complex evaluation of LLMs and the guidance and observations summarized in the introduction are right to the point of what we are looking for. I also find the idea of mixing random skills with evolving N as a way

Weaknesses

1) I'd love to see some correlation of the presented results with human spot-checking and evaluation and if such correlation does not exist, discussion on what might be the reason 2) I'd love to get some suggestion on how this framework can be extended beyond language to Vision - Language Models. 3) I'm curious to see how the experiments with fine-tuning while releasing 10% of skills pan out.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

**********************Motivation:********************** One of the main motivations of the paper is good: LLM evaluations can be trained for so it makes sense to do some combinatorial benchmark which requires combining skills in a way that’s likely unseen in any training corpus. **************Method:************** The idea of skill combination as an LLM evaluation is good, and to the best of my knowledge (i’m not very familiar with LLM benchmarks), novel. It is definitely harder to expect knowl

Weaknesses

****************************Skill picking:**************************** Section 3.1 (and the appendix) describe how the authors pick the skills by hand based on textbooks on logical reasoning, rhetoric, and theory of mind. But what is the justification for using these types of skills? For example, i’m sure there are many skills not related to any of those topics which still require some sort of intelligent capability to combine meaningfully. The authors should explain this better. **************

Videos

SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models· slideslive

Taxonomy

TopicsScientific Computing and Data Management · Evolutionary Algorithms and Applications · Computability, Logic, AI Algorithms

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization