From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Van-Truong Le

arXiv:2604.16270·cs.CL·April 20, 2026

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Van-Truong Le

PDF

TL;DR

This paper presents a comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal texts, combining quantitative benchmarking with qualitative error analysis to assess accuracy, readability, and consistency.

Contribution

It introduces a novel, multifaceted evaluation approach that reveals trade-offs and error types in LLMs applied to complex legal Vietnamese texts.

Findings

01

Grok-1 excels in Readability and Consistency but lacks fine-grained accuracy.

02

Claude 3 Opus achieves high accuracy but has critical reasoning errors.

03

Error analysis highlights Incorrect Examples and Misinterpretation as main failure modes.

Abstract

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.