Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining
Van-Hoang Le, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

TL;DR
This paper introduces a two-stage retrieval and re-ranking framework for Vietnamese legal documents, utilizing semi-hard negative mining and a new evaluation metric to improve accuracy and efficiency.
Contribution
It presents a novel semi-hard negative mining strategy and the Exist@m metric, enhancing legal document retrieval performance with a lightweight, effective approach.
Findings
Achieved top-three ranking in legal document retrieval at SoICT Hackathon 2024.
Significant improvement in re-ranking accuracy using semi-hard negatives.
Demonstrated competitive performance with fewer parameters than ensemble models.
Abstract
Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
