Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining

Van-Hoang Le; Duc-Vu Nguyen; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen

arXiv:2507.14619·cs.IR·July 22, 2025

Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining

Van-Hoang Le, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

PDF

TL;DR

This paper introduces a two-stage retrieval and re-ranking framework for Vietnamese legal documents, utilizing semi-hard negative mining and a new evaluation metric to improve accuracy and efficiency.

Contribution

It presents a novel semi-hard negative mining strategy and the Exist@m metric, enhancing legal document retrieval performance with a lightweight, effective approach.

Findings

01

Achieved top-three ranking in legal document retrieval at SoICT Hackathon 2024.

02

Significant improvement in re-ranking accuracy using semi-hard negatives.

03

Demonstrated competitive performance with fewer parameters than ensemble models.

Abstract

Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.