Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma

TL;DR
This paper introduces a data engineering framework that improves embedding retrofitting for better retrieval by addressing data quality issues caused by annotation artifacts, significantly enhancing performance in real-world scenarios.
Contribution
It demonstrates that preprocessing quality critically impacts retrofitting effectiveness and proposes methods to mitigate noise from annotation artifacts in knowledge graphs.
Findings
Preprocessing reduces spurious edges caused by hashtags.
Retrofitting on cleaned data improves retrieval accuracy by over 6%.
Quality of data preprocessing outweighs algorithm differences.
Abstract
Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ( to , ). After preprocessing, \acrshort{ewma} retrofitting achieves improvement () with benefits concentrated in quantitative synthesis questions ( average). The gap between clean and noisy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Data Quality and Management
