League: Leaderboard Generation on Demand
Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang

TL;DR
This paper presents LAG, a systematic framework that automates the creation of research leaderboards in AI by extracting and integrating experimental results from papers, addressing the challenge of tracking rapid research developments.
Contribution
Introduces a comprehensive, automated approach for generating research leaderboards using LLMs, including collection, extraction, integration, and evaluation methods.
Findings
High-quality leaderboards generated automatically
Effective extraction and integration of experimental results
Reliable evaluation method demonstrated
Abstract
This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* Clear, modular pipeline with practical scope. The four stages and their iteration are well specified, including the choice to operate on LaTeX sections/tables and to retain only “main results”. This design choice addresses token cost and noise pragmatically. * Settings-aware comparison. Extracting experiment settings (e.g., model size, data size) alongside metrics to support fairer comparisons goes beyond prior T-D-M extraction and is valuable if robust. * Comprehensive evaluation framing. T
W1. The evaluation framework relies heavily on LLM-as-judge with limited verified ground truth. Topic-related precision and recall appear to depend on internally defined labels of “relevant” items, yet the paper does not fully specify the gold-standard construction, inter-annotator agreement, or adjudication of borderline topic matches. Content quality is primarily judged by LLMs, with only a correlation study to human preferences, introducing risks of circularity and sensitivity to prompt or mo
The primary strength of this paper is the importance of the task to the practice of AI. A high quality solution to this problem would have a significant impact on the field. I appreciate that the authors performed some analysis on different versions of their system, and performed some "meta evaluation" measurement on correlation of their automatic metrics with human judgements.
- The "internal validity" of the benchmark is questionable due to the very limited meta evaluation. The authors only performed a limited pairwise annotation task with humans and looked at correlation with automatic metrics. The automatic metrics are not well established, and would have expected a much more significant study to validate them, or establish a proper ground truth. This is especially important for this work as I view the primary contribution here being the new task formulation. - The
- The paper presents a clear and well-motivated problem: the difficulty of maintaining up-to-date leaderboards amid the rapid growth of research papers. - The proposed framework is systematic and robust, providing an end-to-end solution covering paper collection, result extraction, and leaderboard generation. - It achieves high efficiency, significantly reducing the time and effort compared to manual curation.
- The framework is well engineered but mainly extends existing LLM-based pipelines without introducing fundamentally new algorithms or modeling techniques. - While structure and coverage quality improve across iterations, the latest score remains largely unchanged. The paper needs deeper analysis of how iterative refinement improves leaderboard quality, especially during the refinement stage. - The paper claims to ensure fair comparisons by aligning experiment settings but does not specify quant
- Reasonable and well-designed pipeline: League presents a clear and structured approach to leaderboard construction. The four-stage design is intuitive and logically organized. - Ablations and stage-wise validation: Intermediate evaluations and ablation studies effectively demonstrate the contribution and effectiveness of each stage and strategy within League.
- Lack of baselines: The paper only compares League against human-written leaderboards. It does not include comparisons with prior automated or semi-automated methods listed in Table 1, which weakens the empirical evaluation. - Questionable evaluation: The size of the evaluation set is not clearly described (Table 2). More details are needed on how the evalaution set is collected and categorized. Additionally, the use of LLM-as-a-judge for these metrics raises concerns (see questions below).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Biomedical Text Mining and Ontologies
