Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
Jinu Lee, Kyoung-Woon On, Simeng Han, Arman Cohan, Julia Hockenmaier

TL;DR
This paper introduces LEGIT, a large-scale legal reasoning dataset with hierarchical issue trees, to evaluate and improve LLMs' legal reasoning through rubrics, retrieval augmentation, and reinforcement learning.
Contribution
The paper presents a novel dataset and rubric-based evaluation method for legal reasoning traces, demonstrating how RAG and RL enhance LLM legal reasoning capabilities.
Findings
LLMs' legal reasoning is heavily influenced by issue coverage and correctness.
Retrieval-augmented generation improves overall reasoning ability.
Reinforcement learning enhances correctness but reduces issue coverage.
Abstract
Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
