Embedding Retrofitting: Data Engineering for better RAG

Anantha Sharma

arXiv:2601.15298·cs.CL·February 18, 2026

Embedding Retrofitting: Data Engineering for better RAG

Anantha Sharma

PDF

Open Access

TL;DR

This paper introduces a data engineering framework that improves embedding retrofitting for better retrieval by addressing data quality issues caused by annotation artifacts, significantly enhancing performance in real-world scenarios.

Contribution

It demonstrates that preprocessing quality critically impacts retrofitting effectiveness and proposes methods to mitigate noise from annotation artifacts in knowledge graphs.

Findings

01

Preprocessing reduces spurious edges caused by hashtags.

02

Retrofitting on cleaned data improves retrieval accuracy by over 6%.

03

Quality of data preprocessing outweighs algorithm differences.

Abstract

Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ( $- 3.5%$ to $- 5.2%$ , $p < 0.05$ ). After preprocessing, \acrshort{ewma} retrofitting achieves $+ 6.2%$ improvement ( $p = 0.0348$ ) with benefits concentrated in quantitative synthesis questions ( $+ 33.8%$ average). The gap between clean and noisy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Data Quality and Management