Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
Ravi Shankar Mishra, Kartik Mehta, Nikhil Rasiwasia

TL;DR
SANTA is a scalable framework that combines optimized syntactic matching with self-supervised token embeddings to improve normalization of e-commerce attribute values, outperforming traditional methods.
Contribution
The paper introduces a novel self-supervised embedding learning approach using a twin network with triplet loss for attribute normalization, surpassing existing string similarity and unsupervised embedding methods.
Findings
Cosine similarity outperforms Jaccard index by 2.7%.
Self-supervised token embeddings improve normalization accuracy.
Proposed method achieves 19.3% better results than unsupervised embeddings.
Abstract
In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
