VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

Tan-Minh Nguyen; Hoang-Trung Nguyen; Trong-Khoi Dao; Xuan-Hieu Phan; Ha-Thanh Nguyen; Thi-Hai-Yen Vuong

arXiv:2507.19995·cs.CL·July 29, 2025

VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong

PDF

Open Access

TL;DR

This paper introduces VLQA, a large, high-quality Vietnamese legal dataset, and demonstrates its usefulness for legal question answering and information retrieval tasks using state-of-the-art models.

Contribution

The paper presents the first comprehensive Vietnamese legal dataset, addressing resource scarcity and enabling improved legal NLP applications in Vietnamese.

Findings

01

VLQA improves legal question answering performance in Vietnamese.

02

State-of-the-art models benefit from the VLQA dataset.

03

The dataset enables better legal information retrieval in Vietnamese.

Abstract

The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Law · Text Readability and Simplification