Character-level Deep Conflation for Business Data Analytics
Zhe Gan, P. D. Singh, Ameet Joshi, Xiaodong He, Jianshu Chen, Jianfeng, Gao, Li Deng

TL;DR
This paper introduces a character-level deep learning model for conflating text attributes in business data, effectively handling misspellings and semantic differences to improve data merging accuracy.
Contribution
It proposes two novel deep conflation models using LSTM and CNN architectures, advancing semantic understanding at the character level for business data analytics.
Findings
Both models outperform baseline bag-of-character approaches.
LSTM-based model achieves higher accuracy in conflation tasks.
Models effectively handle misspellings and semantic variations.
Abstract
Connecting different text attributes associated with the same entity (conflation) is important in business data analytics since it could help merge two different tables in a database to provide a more comprehensive profile of an entity. However, the conflation task is challenging because two text strings that describe the same entity could be quite different from each other for reasons such as misspelling. It is therefore critical to develop a conflation model that is able to truly understand the semantic meaning of the strings and match them at the semantic level. To this end, we develop a character-level deep conflation model that encodes the input text strings from character level into finite dimension feature vectors, which are then used to compute the cosine similarity between the text strings. The model is trained in an end-to-end manner using back propagation and stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Data Quality and Management
