Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via   Bagging and SVR Ensembles

Kevin B\"onisch; Alexander Mehler

arXiv:2501.05018·cs.IR·January 10, 2025

Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles

Kevin B\"onisch, Alexander Mehler

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel ensemble-based retrieval method using Support Vector Regression and embedding spaces for legal document retrieval, achieving improved recall without deep learning models.

Contribution

It introduces a bagging and SVR ensemble approach for legal document retrieval that outperforms baseline methods in recall, without requiring deep learning training.

Findings

01

Recall improved to 0.849 with ensemble

02

Effective in binary needle-in-a-haystack tasks

03

No deep learning training or fine-tuning needed

Abstract

We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 > 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TheItCrOw/LIRAI24
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations · Law in Society and Culture