DistillER: Knowledge Distillation in Entity Resolution with Large Language Models

Alexandros Zeakis; George Papadakis; Dimitrios Skoutas; Manolis Koubarakis

arXiv:2602.05452·cs.DB·February 6, 2026

DistillER: Knowledge Distillation in Entity Resolution with Large Language Models

Alexandros Zeakis, George Papadakis, Dimitrios Skoutas, Manolis Koubarakis

PDF

Open Access

TL;DR

This paper introduces DistillER, a framework that uses knowledge distillation to transfer the capabilities of large language models to smaller, more efficient models for entity resolution, balancing effectiveness and computational cost.

Contribution

It systematically explores data selection, knowledge elicitation, and distillation algorithms to improve LLM-based entity resolution without requiring gold labels.

Findings

01

Supervised fine-tuning on noisy labels outperforms other KD strategies.

02

DistillER achieves better effectiveness and efficiency than existing methods.

03

High-quality explanations can be generated by distilled models.

Abstract

Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate either in unsupervised settings and rely on very large and costly models, or in supervised settings and require ground-truth annotations, leaving a critical gap between time efficiency and effectiveness. To make LLM-powered ER more practical, we investigate Knowledge Distillation (KD) as a means to transfer knowledge from large, effective models (Teachers) to smaller, more efficient models (Students) without requiring gold labels. We introduce DistillER, the first framework that systematically bridges this gap across three dimensions: (i) Data Selection, where we study strategies for identifying informative subsets of data; (ii) Knowledge Elicitation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Machine Learning in Healthcare