JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
Zhenyu Bi, Gaurav Srivastava, Yang Li, Meng Lu, Swastik Roy, Morteza Ziyadi, Xuan Wang

TL;DR
This paper introduces JudgeBoard, a new evaluation framework for small language models to directly assess reasoning correctness, and proposes a multi-agent judging system that improves their performance to match or surpass large models.
Contribution
The work presents JudgeBoard for direct reasoning evaluation and introduces MAJ, a multi-agent framework that enhances small model judgment accuracy.
Findings
MAJ significantly improves small model judgment reliability.
Small models with MAJ can outperform larger models in reasoning judgment.
JudgeBoard enables scalable, fine-grained evaluation of reasoning outputs.
Abstract
While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
