ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts
Nhung Thi-Hong Duong, Mai Ngoc Ho, Tin Van Huynh, and Kiet Van Nguyen

TL;DR
ViLegalNLI is a large-scale, annotated Vietnamese legal NLI dataset created using a semi-automatic framework, enabling research on legal reasoning and AI system development.
Contribution
The paper introduces the first Vietnamese legal NLI dataset, constructed with a novel semi-automatic approach integrating large language models for data generation and validation.
Findings
Few-shot LLMs outperform other models on the dataset.
Performance varies with hypothesis length and reasoning complexity.
Cross-domain evaluation highlights generalization challenges.
Abstract
In this article, we introduce ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. The dataset consists of 42,012 premise-hypothesis pairs derived from official statutory documents and annotated with binary inference labels (Entailment and Non-entailment). It covers multiple legal domains and reflects realistic legal reasoning scenarios characterized by structured logic, conditional clauses, and domain-specific terminology. To construct ViLegalNLI, we propose a semi-automatic data generation framework that integrates large language models for controlled hypothesis generation and systematic quality validation procedures. The framework incorporates artifact mitigation strategies and cross-model validation to improve annotation reliability and ensure legal consistency. The resulting dataset captures diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
