TL;DR
This paper empirically investigates how verifier-based test-time scaling improves large language models' performance in legal reasoning tasks, analyzing factors like domain specificity, model size, and supervision type.
Contribution
It provides the first systematic evaluation of verifier-based TTS methods in legal QA, comparing outcome and process-level verification across multiple benchmarks and reward models.
Findings
Verifier utility varies with domain specialization and model size.
Process-level verification can enhance accuracy under low-N budgets.
Supervision type influences verifier effectiveness.
Abstract
Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-) and process-level (tree search) verification under realistic low- budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
