SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges
Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown

TL;DR
SKATE is a scalable, automated evaluation framework where LLMs generate and solve verifiable challenges, enabling objective, open-ended comparison of models' capabilities without human input.
Contribution
The paper introduces SKATE, a novel framework that uses LLMs to evaluate each other through verifiable tasks, reducing reliance on human expertise and enabling scalable, objective assessment.
Findings
Weaker models can reliably differentiate stronger ones.
LLMs can generate self-preferencing questions.
SKATE reveals fine-grained capability differences.
Abstract
Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The whole idea is novel and well-motivated.The paper identifies two major bottlenecks in LLM evaluation: (1) evaluation requires costly, non-scalable human-annotated ground truths, or (2) relies on LLM-as-judge which is easily manipulated. SKATE attempts to address both through a peer-challenge multi-model system. 2. Methodologies are sound and considerate, such as robust scoring algorithm to adress the multiple-choice biases and question clustering to increase the diversity of questions.
1. While the authors claim that COP tasks provide "a general substrate for evaluating model capabilities," this paper provides zero empirical evidence that SKATE can work with other task types. The COP tasks tested here don't even resemble a typical coding evaluation where models' generation or algorithmic problem-solving capabilities are tested. Any model with code execution tools could solve COP tasks. Without demonstrating SKATE generalisation ability on more diverse verifiable tasks (such as
- the evaluations are verifiable, differing from common LLM-as-Judge frameworks like Alpaca-Eval, where the biases of a judge LLM influences the evaluation - no human input is required in generating the evaluation data sets - the resulting rankings (on code output prediction) are demonstrated to be stable to the addition of more LLMs - the methdology is shown to elicit some fine-grained performance differences between LLMs,. and incorporate varying levels of prior knowledge to assist the LLMs in
the methodology is limited to automatedly verifiable tasks, while many tasks we would like to understand the performance of LLMs on (e.g. alignment), cannot be automatically verifiable - the methodology similarly is limited to tasks that can be posed as multiple choice questions, so it cannot be used to e.g. compare the summarization abilities of LLMs - although the methodology encourages diversity, it does not ensure coverage, so e.g. in the specific case studied in the paper, it could be the c
1) The peer-generated, verifiable tournament evaluation framework is an interesting approach towards a scalable and (hopefully reliable) evaluation framework. 2) The work carefully controls MCQ option/order noise with re-sampling and convergence criteria, adds guardrails that reduce reward hacking, and attempts to enforce question uniqueness via embedding based clustering.
1) The results are limited to COP. The framework would be stronger with at least one additional verifiable task family. 2) It seems like only one embedding model is used for uniqueness filtering (my apologies if I am mistaken). What happens if a different model is used? How robust is uniqueness filtering to the choice of embedding model? 3) Rankings may be dependent on TrueSkill mapping (eg relative vs absolute), p-threshold, number of distractors, number of rounds, etc.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Network Security and Intrusion Detection
