Domain Specific Data Distillation and Multi-modal Embedding Generation

Sharadind Peddiraju; Srini Rajagopal

arXiv:2410.20325·cs.LG·October 29, 2024

Domain Specific Data Distillation and Multi-modal Embedding Generation

Sharadind Peddiraju, Srini Rajagopal

PDF

Open Access

TL;DR

This paper presents a novel hybrid modeling approach that enhances domain-specific embeddings by filtering noise from unstructured data using structured data, leading to improved attribute prediction in the cloud computing domain.

Contribution

It introduces a hybrid collaborative filtering framework that fine-tunes entity representations with relevant item prediction, outperforming traditional autoencoder methods.

Findings

01

28% increase in precision for attribute prediction

02

11% increase in recall for attribute prediction

03

Effective noise filtering from unstructured data

Abstract

The challenge of creating domain-centric embeddings arises from the abundance of unstructured data and the scarcity of domain-specific structured data. Conventional embedding techniques often rely on either modality, limiting their applicability and efficacy. This paper introduces a novel modeling approach that leverages structured data to filter noise from unstructured data, resulting in embeddings with high precision and recall for domain-specific attribute prediction. The proposed model operates within a Hybrid Collaborative Filtering (HCF) framework, where generic entity representations are fine-tuned through relevant item prediction tasks. Our experiments, focusing on the cloud computing domain, demonstrate that HCF-based embeddings outperform AutoEncoder-based embeddings (using purely unstructured data), achieving a 28% lift in precision and an 11% lift in recall for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Video Analysis and Summarization