Character-level Deep Conflation for Business Data Analytics

Zhe Gan; P. D. Singh; Ameet Joshi; Xiaodong He; Jianshu Chen; Jianfeng; Gao; Li Deng

arXiv:1702.02640·cs.CL·February 10, 2017·1 cites

Character-level Deep Conflation for Business Data Analytics

Zhe Gan, P. D. Singh, Ameet Joshi, Xiaodong He, Jianshu Chen, Jianfeng, Gao, Li Deng

PDF

Open Access 2 Repos

TL;DR

This paper introduces a character-level deep learning model for conflating text attributes in business data, effectively handling misspellings and semantic differences to improve data merging accuracy.

Contribution

It proposes two novel deep conflation models using LSTM and CNN architectures, advancing semantic understanding at the character level for business data analytics.

Findings

01

Both models outperform baseline bag-of-character approaches.

02

LSTM-based model achieves higher accuracy in conflation tasks.

03

Models effectively handle misspellings and semantic variations.

Abstract

Connecting different text attributes associated with the same entity (conflation) is important in business data analytics since it could help merge two different tables in a database to provide a more comprehensive profile of an entity. However, the conflation task is challenging because two text strings that describe the same entity could be quite different from each other for reasons such as misspelling. It is therefore critical to develop a conflation model that is able to truly understand the semantic meaning of the strings and match them at the semantic level. To this end, we develop a character-level deep conflation model that encodes the input text strings from character level into finite dimension feature vectors, which are then used to compute the cosine similarity between the text strings. The model is trained in an end-to-end manner using back propagation and stochastic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Data Quality and Management