Exploring the Latest LLMs for Leaderboard Extraction
Salomon Kabongo, Jennifer D'Souza, and S\"oren Auer

TL;DR
This study evaluates the effectiveness of various Large Language Models in extracting structured leaderboard data from AI research papers, comparing different input contexts to identify best practices for automation.
Contribution
It systematically assesses multiple LLMs and input formats for leaderboard extraction, providing new insights into their relative performance and limitations.
Findings
GPT-4-Turbo performs best among tested models.
Context type DocREC yields higher accuracy than others.
Significant variability in model performance depending on input format.
Abstract
The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experimental Setup, and Tabular Information), DocREC (Results, Experiments, and Conclusions), and DocFULL (entire document). Our comprehensive study evaluates the performance of these models in generating (Task, Dataset, Metric, Score) quadruples from research papers. The findings reveal significant insights into the strengths and limitations of each model and context type, providing valuable guidance for future AI research automation efforts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer
