SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

Dewi S. W. Gould; Bruno Mlodozeniec; Samuel F. Brown

arXiv:2508.06111·cs.AI·February 13, 2026

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown

PDF

Open Access 3 Reviews

TL;DR

SKATE is a scalable, automated evaluation framework where LLMs generate and solve verifiable challenges, enabling objective, open-ended comparison of models' capabilities without human input.

Contribution

The paper introduces SKATE, a novel framework that uses LLMs to evaluate each other through verifiable tasks, reducing reliance on human expertise and enabling scalable, objective assessment.

Findings

01

Weaker models can reliably differentiate stronger ones.

02

LLMs can generate self-preferencing questions.

03

SKATE reveals fine-grained capability differences.

Abstract

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The whole idea is novel and well-motivated.The paper identifies two major bottlenecks in LLM evaluation: (1) evaluation requires costly, non-scalable human-annotated ground truths, or (2) relies on LLM-as-judge which is easily manipulated. SKATE attempts to address both through a peer-challenge multi-model system. 2. Methodologies are sound and considerate, such as robust scoring algorithm to adress the multiple-choice biases and question clustering to increase the diversity of questions.

Weaknesses

1. While the authors claim that COP tasks provide "a general substrate for evaluating model capabilities," this paper provides zero empirical evidence that SKATE can work with other task types. The COP tasks tested here don't even resemble a typical coding evaluation where models' generation or algorithmic problem-solving capabilities are tested. Any model with code execution tools could solve COP tasks. Without demonstrating SKATE generalisation ability on more diverse verifiable tasks (such as

Reviewer 02Rating 8Confidence 4

Strengths

- the evaluations are verifiable, differing from common LLM-as-Judge frameworks like Alpaca-Eval, where the biases of a judge LLM influences the evaluation - no human input is required in generating the evaluation data sets - the resulting rankings (on code output prediction) are demonstrated to be stable to the addition of more LLMs - the methdology is shown to elicit some fine-grained performance differences between LLMs,. and incorporate varying levels of prior knowledge to assist the LLMs in

Weaknesses

the methodology is limited to automatedly verifiable tasks, while many tasks we would like to understand the performance of LLMs on (e.g. alignment), cannot be automatically verifiable - the methodology similarly is limited to tasks that can be posed as multiple choice questions, so it cannot be used to e.g. compare the summarization abilities of LLMs - although the methodology encourages diversity, it does not ensure coverage, so e.g. in the specific case studied in the paper, it could be the c

Reviewer 03Rating 4Confidence 3

Strengths

1) The peer-generated, verifiable tournament evaluation framework is an interesting approach towards a scalable and (hopefully reliable) evaluation framework. 2) The work carefully controls MCQ option/order noise with re-sampling and convergence criteria, adds guardrails that reduce reward hacking, and attempts to enforce question uniqueness via embedding based clustering.

Weaknesses

1) The results are limited to COP. The framework would be stronger with at least one additional verifiable task family. 2) It seems like only one embedding model is used for uniqueness filtering (my apologies if I am mistaken). What happens if a different model is used? How robust is uniqueness filtering to the choice of embedding model? 3) Rankings may be dependent on TrueSkill mapping (eg relative vs absolute), p-threshold, number of distractors, number of rounds, etc.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Network Security and Intrusion Detection