Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes
Fusheng Wei, Robert Neary, Han Qin, Qiang Mao, Jianping Zhang

TL;DR
This study compares four text embedding models to evaluate their effectiveness in classifying delay-related statements in construction dispute documents, aiming to improve legal document review processes.
Contribution
It provides a comprehensive analysis of multiple embedding models for binary text classification in construction delay disputes, highlighting their potential to improve legal document review.
Findings
Embedding models can effectively classify delay-related text snippets.
Logistic Regression outperforms KNN in this classification task.
Embedding-based classification enhances document review efficiency.
Abstract
Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers. These embeddings capture semantic meanings and relationships between text elements in a continuous vector space. The primary goal of text embeddings is to enable the processing of text data by machine learning models, which require numerical input. Numerous embedding models have been developed for various applications. This paper presents our work in evaluating different embeddings through a comprehensive comparative analysis of four distinct models, focusing on their text classification efficacy. We employ both K-Nearest Neighbors (KNN) and Logistic Regression (LR) to perform binary classification tasks, specifically determining whether a text snippet is associated with 'delay' or 'not delay' within a labeled dataset. Our research explores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDispute Resolution and Class Actions · Artificial Intelligence in Law · linguistics and terminology studies
MethodsLogistic Regression
