Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

Davide Romano; Jonathan Schwarz; Daniele Giofr\'e

arXiv:2510.25623·cs.CL·October 31, 2025

Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

Davide Romano, Jonathan Schwarz, Daniele Giofr\'e

PDF

1 Video

TL;DR

This paper empirically investigates how verifier-based test-time scaling improves large language models' performance in legal reasoning tasks, analyzing factors like domain specificity, model size, and supervision type.

Contribution

It provides the first systematic evaluation of verifier-based TTS methods in legal QA, comparing outcome and process-level verification across multiple benchmarks and reward models.

Findings

01

Verifier utility varies with domain specialization and model size.

02

Process-level verification can enhance accuracy under low-N budgets.

03

Supervision type influences verifier effectiveness.

Abstract

Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of- $N$ ) and process-level (tree search) verification under realistic low- $N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks· underline