Leveraging Machine Learning and Large Language Models for Automated Image Clustering and Description in Legal Discovery
Qiang Mao, Fusheng Wei, Robert Neary, Charles Wang, Han Qin, Jianping Zhang, Nathaniel Huber-Fliflet

TL;DR
This paper explores automated image clustering and description generation using machine learning and large language models to improve efficiency in legal discovery workflows involving large-scale image datasets.
Contribution
It systematically evaluates sampling, prompting, and description methods, demonstrating effective strategies for scalable, accurate image cluster descriptions with reduced computational costs.
Findings
Sampling 20 images per cluster maintains quality while reducing costs
LLM-based descriptions outperform traditional TF-IDF methods
Standard prompts outperform chain-of-thought prompts in this context
Abstract
The rapid increase in digital image creation and retention presents substantial challenges during legal discovery, digital archive, and content management. Corporations and legal teams must organize, analyze, and extract meaningful insights from large image collections under strict time pressures, making manual review impractical and costly. These demands have intensified interest in automated methods that can efficiently organize and describe large-scale image datasets. This paper presents a systematic investigation of automated cluster description generation through the integration of image clustering, image captioning, and large language models (LLMs). We apply K-means clustering to group images into 20 visually coherent clusters and generate base captions using the Azure AI Vision API. We then evaluate three critical dimensions of the cluster description process: (1) image sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence Applications · Artificial Intelligence in Law
