Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Ravi Shankar Mishra; Kartik Mehta; Nikhil Rasiwasia

arXiv:2106.09493·cs.CL·June 18, 2021

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Ravi Shankar Mishra, Kartik Mehta, Nikhil Rasiwasia

PDF

TL;DR

SANTA is a scalable framework that combines optimized syntactic matching with self-supervised token embeddings to improve normalization of e-commerce attribute values, outperforming traditional methods.

Contribution

The paper introduces a novel self-supervised embedding learning approach using a twin network with triplet loss for attribute normalization, surpassing existing string similarity and unsupervised embedding methods.

Findings

01

Cosine similarity outperforms Jaccard index by 2.7%.

02

Self-supervised token embeddings improve normalization accuracy.

03

Proposed method achieves 19.3% better results than unsupervised embeddings.

Abstract

In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.