TL;DR
TriBench-Ko is a Korean benchmark designed to evaluate the risks and performance of large language models in real judicial tasks, highlighting significant challenges and areas needing caution.
Contribution
It introduces a comprehensive benchmark for assessing LLM risks in judicial workflows, covering four core legal tasks with detailed risk categories.
Findings
Many LLMs struggle with precedent retrieval.
Models often manifest significant risks like hallucination and bias.
Outputs frequently fail to capture critical legal information.
Abstract
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
