A Clustering-Based Combinatorial Approach to Unsupervised Matching of Product Titles
Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis, Christos, Makris

TL;DR
This paper presents UPM, an unsupervised, parameter-free clustering algorithm for matching product titles in e-commerce, which outperforms existing methods by analyzing word combinations without external data or pairwise comparisons.
Contribution
The paper introduces UPM, a novel unsupervised and parameter-free clustering approach that effectively matches products based on titles without external data or pairwise comparisons.
Findings
UPM outperforms state-of-the-art methods in efficiency.
UPM achieves higher accuracy in product matching.
The approach is independent of external data sources.
Abstract
The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employed external data sources (search engines) to enrich the titles; these solutions are rather impractical mainly because the external data fetching is slow. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles. UPM is independent of any external sources, since it analyzes the titles and extracts…
| Type | Semantics | Identification Rule/s |
|---|---|---|
| 1 | Attribute | i) numeric tokens followed by measurement units, or ii) mixed tokens ending in a measurement unit |
| 2 | Model | The first mixed token in the title which does not represent an attribute |
| 3 | Model | All the rest mixed tokens in the title which do not represent an attribute |
| 4 | Model | A numeric token which is not followed by a measurement unit |
| 5 | Normal | All the other tokens of the title |
| Dataset | Titles | |||
| CPUs | 37 | 1901 | 3862 | 11.285 |
| Digital Cameras | 103 | 836 | 2697 | 9.605 |
| Dishwashers | 94 | 1678 | 3424 | 6.819 |
| Microwaves | 114 | 1039 | 2342 | 7.591 |
| Mobile Phones | 84 | 1837 | 4081 | 8.416 |
| Refrigerators | 118 | 5172 | 11291 | 7.847 |
| TVs | 129 | 1678 | 3564 | 10.263 |
| Washing Machines | 87 | 1703 | 4044 | 7.931 |
| PriceRunner Aggregate | 306 | 15844 | 35305 | 8.560 |
| Air Conditioners | 216 | 1442 | 13595 | 10.497 |
| Car Batteries | 66 | 2097 | 5864 | 8.073 |
| Cookers & Ovens | 163 | 1355 | 10858 | 6.455 |
| CPUs | 92 | 356 | 1906 | 9.115 |
| Digital Cameras | 152 | 973 | 4111 | 8.802 |
| Refrigerators | 161 | 1697 | 16177 | 5.955 |
| TVs | 205 | 1246 | 7002 | 7.382 |
| Watches | 212 | 60559 | 178657 | 6.517 |
| Skroutz Aggregate | 652 | 68512 | 238170 | 6.827 |
| Method | PriceRunner Aggregate | Skroutz Aggregate | ||
|---|---|---|---|---|
| Time (sec) | Gain (x) | Time (sec) | Gain (x) | |
| UPM+ | 37 | - | 1638 | - |
| UPM | 53 | 1.43 | 1744 | 1.06 |
| 430 | 11.62 | 14156 | 8.64 | |
| 1065 | 28.78 | 40176 | 24.52 | |
| 761 | 20.56 | 23257 | 14.20 | |
| 1387 | 37.48 | 46442 | 28.35 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Biomedical Text Mining and Ontologies
∎
11institutetext: F. Author 22institutetext: first address
Tel.: +123-45-678910
Fax: +123-45-678910
22email: [email protected] 33institutetext: S. Author 44institutetext: second address
A Clustering-Based Combinatorial Approach to Unsupervised Matching of Product Titles
Leonidas Akritidis
Athanasios Fevgas
Panayiotis Bozanis
Christos Makris
(Received: date / Accepted: date)
Abstract
The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employed external data sources (search engines) to enrich the titles; these solutions are rather impractical mainly because the external data fetching is slow. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles. UPM is independent of any external sources, since it analyzes the titles and extracts combinations of words out of them. These combinations are evaluated according to several criteria, and the most appropriate of them constitutes the cluster where a product is classified into. UPM is also parameter-free, it avoids product pairwise comparisons, and includes a post-processing verification stage which corrects the erroneous matches. The experimental evaluation of UPM demonstrated its superiority against the state-of-the-art approaches in terms of both efficiency and effectiveness.
Keywords:
product matching entity matching entity resolution clustering unsupervised learning machine learning data mining
1 Introduction
The online comparison of products is a crucial process, since it is usually the first step in the life cycle of an electronic sale. Before a purchase is completed, the majority of users search, collect and aggregate the characteristics both of the desired, and of any similar products. For this reason, the role of the product comparison services has been rendered increasingly important. These platforms retrieve data from various sources including electronic stores, suppliers and reviews sites and they merge the information which refers to identical products. In the sequel, they present this information to their users, allowing them to compare a variety of parameters such as features and prices. They also facilitate the aggregation of user opinions and reviews.
Since the products-related data originates from multiple sources, it presents a high degree of diversity. To implement their comparison tools, the aggregation platforms must develop algorithms which identify identical products. Apparently, the problem of product matching is vital for these platforms, their users, and e-commerce industry in general.
Due to its importance, there exists a significant amount of research on this interesting problem. The relevant literature includes solutions which can be divided into two categories: The first one contains works which address the problem by examining solely the product titles. Earlier studies employed standard string similarity methods including the cosine and edit-distance measures sigmod2003 ; tkde2007 ; icde2007 ; irintro2008 ; tkdd2008 ; vldb2011 ; lcs2012 ; sigmod2013 ; ijca2013 . However, cikm2012 showed that these metrics are inadequate on this particular problem; frequently, identical products are described by very diverse titles, whereas highly similar titles do not necessarily represent identical products.
For this reason, the method of cikm2012 employs Web search engines with the aim of enriching the product titles with important missing words. For each title, the algorithm submits a query to a Web search engine and, in the sequel, it collects and processes the returned results to identify such words. A similar approach is introduced in vldb2014 , where the titles are modeled as graphs and a clustering algorithm determines whether these graphs form a cohesive, or separately clustered communities. However, the submission of a query in a search engine and the subsequent processing of the returned results are expensive operations. Additionally, the provided APIs do not allow unlimited usage and there is a limit to the number of the queries which can be submitted on a daily basis. These limitations render these two approaches not applicable to large datasets with millions of products.
The second category includes methods which take into consideration additional features such as brands, manufacturers, categories, etc. More specifically, FEBRL provides an implementation based on SVMs for learning suitable matcher combinations febrl2008 , and MARLIN offers a set of several learning methods such as SVMs and decision trees, combined with two similarity measures sigkdd2003 . Nonetheless, these methods exhibit one significant problem: Since an aggregation service is fed with data from multiple non controlled sources, many of the product attributes which are present in one feed, may be absent in another. Even if an attribute is provided by all sources, the data is frequently skewed or incomplete. In such occasions, it is inevitable that the methods of this category will not perform well.
In this paper we present UPM (Unsupervised Product Matcher), an unsupervised algorithm for matching products by their titles. The following list contains a brief description of the parts of the algorithm and summarizes the contributions of this work:
- •
UPM is based on the concept of unsupervised entity resolution via clustering. In details, it constructs combinations of the words of the titles and assigns scores to each one of them, similarly to inista2018 . The highest-scoring combination (called cluster) is the one which best represents the identity of a product. All the products within the same cluster are considered to be matching each other.
- •
It performs morphological analysis of the product titles and identifies potentially useful tokens (attributes, models, etc.). Each title is then split into virtual fields, and the tokens are distributed to these fields according to their form and semantics.
- •
It assigns scores to these fields and in the sequel, it plugs these scores into a function which evaluates the combinations. This function also takes into consideration additional properties of a combination, including its position in a title and its frequency.
- •
It includes a post-processing verification stage which is executed after the formation of the clusters. Based on the observation that very rarely a product appears twice within the catalog of a vendor, this stage either moves products from one cluster to another, or it creates new clusters. This stage leads to significant gains in the matching performance of the algorithm.
- •
Unlike the aforementioned methods, our algorithm does not perform pairwise comparisons between the products to determine whether they match or not. Therefore, it avoids the quadratic complexity of this procedure, and also, it does not require the invention of an additional blocking policy.
- •
The following presentation introduces several parameters for UPM. However, there are global settings for these parameters which consistently lead to satisfactory performance. The fixing of these values ultimately leads to a method which is parameter-free.
The rest of the paper is organized as follows: Section 2 consists of six subsections which describe the core parts of the algorithm. In particular, the first five present the primary data structures and their construction method, the combinations scoring function, the cluster selection strategy and the verification stage of UPM. Subsection 2.6 is dedicated to the fixing of the various parameters. The experimental evaluation of the algorithm is conducted in Section 3 and the final conclusions are summarized in Section 4. For research purposes, both the code we developed and the datasets we utilized have been made publicly available on GitHub.
2 Unsupervised Product Matching
Let us consider a set of vendors which includes electronic stores, suppliers, auction platforms and so on. Each vendor distributes an electronic catalog which contains the products s/he provides, accompanied by some additional useful information. In case this information is organized in a structured (or semi-structured) form, the catalog is called a feed and the products are stored as a collection of successive records. Each record is comprised of an arbitrary number of attributes including its title, brand, model, and others.
Moreover, a vendor creates its feed independently of the others; hence, may provide information about the brand or the category of a product, whereas may not. Even if both and include this information in their feeds, there may be discrepancies which inevitably lead to skewed data.
Nevertheless, all feeds must contain at least one descriptive title for each included product. Two or more vendors may use diverse titles to describe the same product. In the following subsections we describe an unsupervised algorithm which matches products by overcoming this diversity.
2.1 Combinations vs. n-grams
The string of a product title usually consists of multiple types of substrings, including words, model descriptions, technical specifications, etc. We collectively refer to all these substrings as tokens. Let be the set of all tokens of a product title . Then, a -combination is defined as any subset of of size , without repetition and without care for tokens ordering. For example, if consists of three tokens , then there are three possible 2-combinations, , and one 3-combination, . In case consists of tokens (i.e., its length is ), then the number of all possible -combinations is equal to the binomial coefficient:
[TABLE]
and the construction complexity is exponential .
Notice that -combinations are different than -grams: the latter are computed by sliding a window of length over the examined string, from the left to the right; therefore, -grams capture only successive tokens. However, in a product title the important tokens (brand, model, etc.) are usually scattered across the string and also, in non adjacent positions. Although the construction of -combinations is more expensive, they were preferred over -grams because of their ability to bring non-adjacent tokens together.
For example, there is no common 2-gram or 3-gram for the titles nVidia GeForce GTX1050 4GB and GeForce 4GB GTX1050. Hence, -grams cannot identify the similarity between these two products. On the other hand, there are two common 2-combinations, namely GeForce GTX1050 and GeForce 4GB. Apparently, -combinations outperform -grams on this particular problem.
Since the construction of all -combinations is of exponential complexity, it is required to limit their number to a minimum. Fortunately, our experiments showed that titles contain on average 6-11 tokens depending on the category of the product, and also, only a portion of them is important for the identification of a product. For this reason, we limit the computations to the first 2-, 3--combinations of the tokens of the involved titles.
Eventually, for a title which consists of tokens, the total number of combinations to be computed is:
[TABLE]
For the sake of simplicity, in the presentation which follows we use the term “combination” instead of -combination, and the simplified notation instead of .
2.2 Morphological Analysis & Token Semantics
Each title consists of tokens which are not equally important for the description of a product. Vendors may provide irrelevant information in a title, including payment facilities, special discounts, offers, shipping and delivery data, availabilities and so on. Such kinds of information are considered as noise; consequently, they may degrade the effectiveness of an entity matching algorithm.
The unsupervised extraction of the hot tokens from a title is a particularly challenging task, since vendors use different syntactical rules to express the information of their products and, also, each product type presents its own specificity. Nevertheless, in this paper we perform morphological analysis of the titles with the aim of identifying these hot tokens. In particular, we initially examine the form of the tokens and we categorize each one of them as:
- •
Mixed, in case it contains both digits and letters, or
- •
Numeric, in case it contains only digits (with a thousands or a decimal separator), or
- •
Alphabetic, in all other cases.
In the sequel, we identify the following important pieces of information:
1) Product Attributes: The attributes of a product are important, since they can be used to differentiate it from another similar product. For instance, the 32 GB version of a cell phone is a different product compared to the 64 GB version of the same model. This is valid for multiple product types (e.g. hardware, electrical and electronic devices, etc.).
The process is based on a small lexicon of measurement units (e.g. bytes, hz, bps, meters, etc.) and of their multiples and sub-multiples. By employing this lexicon, an attribute is identified either i) when a pair of a numeric token and a measurement unit is encountered (e.g. 32 GB), or ii) when the ending of a mixed token is a measurement unit and its suffix consists of digits only (e.g. 32GB). In the former case, the two tokens of the pair are concatenated into one, with the aim of eliminating the difference with the latter case.
2) Models: The model descriptors are the most fundamental part of a product title, since they represent its identity. Unfortunately, the models may receive forms which vary significantly among vendors, and moreover, a specific model may appear under different forms (e.g. PS 3 vs. PS3 vs. Playstation3, etc.). Consequently, it is particularly hard for an unsupervised technique to correctly identify such model descriptors with absolute accuracy.
Nevertheless, the approach we present here yields significant improvements in the performance of our matching algorithm. We consider that a token is a possible model descriptor if it is either mixed, or numeric and it is not followed by a measurement unit. In addition, not all mixed tokens are treated equally. For instance, the first mixed token in a product title is considered to be more possible to contain a model compared to the second or the third mixed token.
In case a token does not fall into one of the above categories, then it is classified as a normal token. Table 1 summarizes the five aforementioned semantics accompanied by their identification rules.
The morphological analysis of a title includes several additional steps which are performed with the aim of removing the discrepancies between tokens with the same meaning. More specifically, the product titles of the dataset are parsed sequentially and the following procedures are applied to the extracted tokens:
- •
Case folding: all letters are converted to lower case.
- •
Punctuation removal: all punctuation symbols and marks are removed from a title apart from i) dots and commas which are thousands or decimal separators, and ii) hyphens and slashes which delimit tokens. In the latter case, these tokens are appended in the title.
- •
Duplicate tokens removal: the existence of two or more identical tokens in a product title is rare. However, we found that their removal improves the performance of the algorithm by a significant margin.
2.3 Data Structures Construction
After the tokenization and the morphological analysis of a title has been completed, the extracted tokens and combinations are used to build the following data structures:
2.3.1 Tokens Lexicon
This is an ordinary lexicon structure , which is used to store the tokens extracted from the titles of the products. For each token , the tokens lexicon also maintains:
- i)
a unique integer identifier (token ID),
- ii)
a frequency value which represents the number of products that contain in their titles, and
- iii)
a special variable which is set equal to the semantics of , as indicated by the first column of Table 1.
2.3.2 Combinations Lexicon
The combinations lexicon stores the -combinations () of the tokens of the product titles. The representation of the stored combinations is of particular importance, since it must support not only fast searching, but also searching for combinations with different orderings of their tokens. For instance, consider the case where we extract the 3-combination CPU 3.2GHz 32MB, which does not exist in . Instead, suppose that contains the 3-combination CPU 32MB 3.2GHz, which is the same as the one we search for, but with different ordering of its tokens. In such cases, we desire to identify the equality between the two records to avoid the insertion of the same combination twice.
The proposed algorithm satisfies this requirement by assigning signatures to all combinations. The key concept is that a combination must have the same signature independently of the ordering of its component tokens. This means that the two 3-combinations of the previous example, i.e. CPU 32MB 3.2GHz and CPU 3.2GHz 32MB must be assigned equal signatures. More specifically, the following procedure is applied: before a combination is inserted into , its signature is computed. In case is not found in , then does not exist in in any form (i.e. under any ordering of its tokens) and can safely be inserted to it. Otherwise, resides within in one form or another.
A simple method for computing the signature of a combination is via tokens sorting and hashing. More specifically, this method initially retrieves the IDs of the component tokens of and sorts them in increasing order. The sorted values are then concatenated and delimited by a special symbol (e.g. a single space character). The string we obtain is subsequently passed through a string hash function which is common for all combinations. The output of constitutes the desired signature .
As we demonstrate later in the experimental evaluation, the usage of signatures leads to substantial improvements in the efficiency of the algorithm. Eventually, each combination record , possesses the following properties:
- i)
its signature ,
- ii)
a frequency value which represents the number of product titles which contain , and
- iii)
a distance accumulator which maintains the sum of the distances of from the beginning of the titles. This value will be used later to assign a score to .
2.3.3 Forward Index
The forward index is essentially a list of all product records. Each product is associated with two pointer lists:
- i)
the tokens forward list , which maintains pointers to the tokens of the title of ; and
- ii)
the combinations forward list , namely, a list of pointers (given by eq. 2). Each pointer refers to a combination of , where .
In Figure 1 we depict the interconnection of the forward index with the tokens and the combinations lexicon structures. Notice that the existence of pointers in the forward index saves us the cost of storing the same data twice.
2.3.4 Construction Algorithm
Algorithm 1 presents the construction methodology of the aforementioned data structures. Initially, each product enters the forward index with its tokens and combinations lists empty (step 4). In the sequel, its title is parsed and its tokens are extracted. Each token passes through a filtration process where the morphological analysis of the previous subsection is performed (i.e. case folding, punctuation removal, etc.). Moreover, the semantics of is identified according to the rules of Table 1 (step 7).
After this process, a search for in is performed (step 8). Notice that the search operation returns a pointer to the corresponding token record in . In case the search is unsuccessful, is inserted in with and a pointer to the new record is returned; in the opposite case, its corresponding frequency value increases by 1 (steps 9–14). Finally, the pointer is inserted into the tokens list of within the forward index (step 15).
The procedure continues with the computation of all -combinations of and the generation of their respective signatures (steps 17–20). Then, for each combination the lexicon is queried against its signature . If this search is unsuccessful, is inserted in with and a pointer to the new record is returned; otherwise, increases by one (steps 23–30). The algorithm ends with the insertion of the pointer in the combinations list of within the forward index in step 31.
During this process, the distance of from the beginning of the product title is calculated, and it is used to update the distance accumulator . This distance value will be employed later by the combinations scoring function. We provide more details about the usage of the distance accumulator in the next subsection.
2.4 Scores Computation & Cluster Selection
In summary, the purpose of this phase is to compute an importance score for each combination of each product of the forward index. The highest-scoring combination will then be declared as the dominating cluster where will be mapped to. All the other products which will also be mapped to will be considered that they match . Finally, the clusters of all products will be utilized to build the clusters universe which shall assist us further.
We now elaborate on the form of the combinations score function. Initially, we study the properties that a combination must possess to be declared as a dominating cluster, and then we proceed to the quantification of these properties.
- •
Frequency: The number of products which contain is an important parameter, since the more frequent a combination is, the more products will be mapped to it. In contrast, if we select a rare combination, we shall not be able to map any other product to it.
- •
Length: The frequency criterion definitely favors the short combinations because it is more possible to encounter a 2-combination which is common for multiple products, compared to a 3-combination. However, the short combinations are not as descriptive as the longer ones and also, there is a risk of creating very inhomogeneous clusters which may erroneously contain different products.
- •
Position: A broadly accepted idea in information retrieval dictates that the most important words of a document usually appear early, that is, in a small distance from its beginning.
- •
Hot tokens: A combination which contains multiple highly informational tokens represents the identity of the product more accurately compared to one which does not include such tokens.
Given a title , a combination of , and a token , we consider that is the position (or offset) of in and is the position of in . By using this notation, the distance between and is computed by employing the well-established Euclidean distance for strings:
[TABLE]
Based on this equation we compute the average distance of from the beginning of all titles as follows:
[TABLE]
The four aforementioned properties of a combination can now be quantified by the following scoring function:
[TABLE]
where is a constant which i) prevents from getting infinite when (i.e, when appears always in the beginning of all titles), and ii) determines the importance of proximity in the overall score of a combination.
The factor constitutes the IR score of and it is built by adopting the spirit of the BM25F scoring method for structured documents iexml2005 . This scheme is designed to boost the scores of the words which appear in highly important places of a document (called fields), such as its title.
Although a product title is clearly a short unstructured text, here we introduce the idea of splitting a title into virtual fields, based on the aforementioned semantics of each token. According to this approach, a title is divided into five virtual fields, from to . Each field is allowed to contain only tokens which have identical semantics. For example, according to Table 1, shall accommodate only the tokens which represent the attributes of a product, whereas and enlist the tokens which potentially carry information about the model. Notice that a field may be completely empty, whereas a token can belong to only one field.
Similarly to BM25F, the factor is computed by applying the following equation:
[TABLE]
where is the weight of the field which contains a token . Notice here the dependence of this weight from the semantics value . Furthermore, is the inverse document frequency of (where is the total number of product titles). Recall also that symbolizes the length of , whereas is the average length (in number of tokens) of all combinations in the dataset. Finally, is a constant whose value falls into the range .
In conclusion, eq. 5 indicates that a product should be clustered under a combination which: i) is frequent, ii) is reasonably long, iii) usually occurs near the beginning of the titles and iv) contains multiple important tokens. In the sequel, we employ it to identify the most appropriate cluster for every product of the forward index .
Algorithm 2 contains the details of this procedure. Notice that since we are only interested in the highest-scoring combination, it is not mandatory to store the scores of all combinations in some dedicated data structure (e.g. heap); a simple computation of the maximum score suffices.
Initially, an empty set is initialized. In the sequel, we iterate through the products of and for each product we traverse its combinations forward list . For each combination , the field lengths are stored within an array , according to the semantics of the tokens of (steps 6–8). In steps 9–13 the IR score of eq. 6 is calculated, whereas the next step computes the average distance . Having prepared this data, the score of is obtained in step 15. In steps 16–19 we conditionally update the maximum score and the highest-scoring combination.
The combination with the maximum score is subsequently selected as the dominating cluster, or simply the cluster of . In the sequel, is inserted into the global set , along with the corresponding product , according to the steps 2–6 of Algorithm 3. Notice that insertion includes additional operations after step 6, which are described in details in the next subsection. Finally, the algorithm deallocates the resources occupied by data which are not useful for the next step, including the combinations which have not been declared clusters, that is, .
2.5 Verification Stage & Cluster Correction
The procedures of the previous subsections achieve their goal, that is, unsupervised product matching by using only their titles. However, there is still room for improvement.
Here we present a post-processing verification step which attempts to recognize false matches. In the absence of training data, it is based on a simple, but strong hypothesis: In the vast majority of cases, each product appears only once in the feed of the same vendor, or equivalently, a vendor does not include identical products in his/her catalog. Of course, there are some individual cases where the same product indeed exists multiple times within a catalog of a vendor. However, such cases are extremely rare and they usually occur by mistake.
This hypothesis, combined with the fact that a cluster contains products which are considered to match each other (i.e. they are identical), leads to the following lemma:
Lemma 1
A cluster cannot contain two or more products from the same vendor .
Proof
Suppose that contains two products and from the same vendor . Since contains only products which match each other, is identical to . But then included the same product multiple times in his/her catalog, a statement which contradicts our hypothesis.
This lemma drives the entire verification stage. Based on it, we say that is a violator of , if contains two or more of his/her products. In this case, is an invalid cluster and it requires a special validation process to be applied to it. In short, this process: i) allows only one product of in , and ii) evicts the rest products of from . The evicted products can either: i) migrate to another existing cluster according to some criteria, or ii) be transferred to a new cluster.
Recall that technically, a cluster is merely a combination object and as such, it posesses the properties of Subsection 2.3.2. To support the verification stage, a cluster must be extended with the following elements:
- •
A list with the vendors of the products of ,
- •
One of the products of is selected as the representative product of , according to a score. The title of is used as a label for and thus, cannot leave .
- •
One list per vendor which stores the products that both belong to and are provided by . Each product is assigned two scores: i) will be used to select the representative product , and ii) which stores the similarity of with .
These elements are computed immediately during the insertion of and into (step 21 of Algorithm 2). The steps 7–14 of Algorithm 3 describe this process: initially, the vendor of is inserted into the list (provided that ). In the sequel, the score is computed, and in case it exceeds the maximum product score in the cluster, then is declared as the representative product of the cluster.
After the required data has been prepared, the verification stage of Algorithm 4 is executed. For each cluster we traverse its list of vendors and in case a violator is found (i.e., ), we identify which product of will stay in . This is achieved by calculating the similarity score of each product with the representative product , and by sorting in decreasing similarity score order (steps 3–7). The first record of the list, namely, the most similar product to , is selected to remain in ; the rest products will eventually abandon .
There exist two options to handle the evicted products. The former is applied when there exists another cluster whose representative product is highly similar to an evicted product . In that case, migrates to , provided that does not contain any other product of and it will not become invalid after the insertion of . If no cluster of satisfies this criteria, then the latter option dictates that we create a new cluster , append to the universe , and finally, transfer to (steps 8–17).
The final point which needs to be clarified is the method for retrieving the clusters which are both valid and relevant to an evicted product of a vendor (step 10). The strategy we adopted was to compute the cosine similarity of with the representative product of each candidate cluster which did not contain any other products of . In case the maximum computed similarity is above a predefined threshold , then is inserted into the corresponding cluster. Otherwise, a new cluster is created and is transferred there.
2.6 Parameter Fixing
Until this point, we introduced five parameters in the presentation of UPM. Here we fix the values of these parameters based on the conclusions of exhaustive experimentation with multiple datasets. The purpose of setting fixed values to all parameters is to present an algorithm which is not only unsupervised, but also parameter-free.
We begin with , the modifier which determines the maximum number of tokens which can be used in a single combination. In all cases, the value which maximized the effectiveness of the algorithm was found to be equal to the half of the average title length, that is:
[TABLE]
Larger or smaller values of have a negative impact on performance. This observation leads to the conclusion that, on average, only a portion of the tokens of a title are actually important for the identification of a product. This conclusion established the basis of UPM+, a simple variant which takes into consideration only the first tokens of a title, and ignores the rest of them. Therefore, the extracted combinations are reduced by a significant margin (especially in the case of long titles), whereas it is anticipated that we only suffer a small loss in matching performance. This anticipation is verified experimentally in Section 3.
In addition, Eq. 5 depends on , which determines the importance of proximity in the score of a combination. Our experiments revealed that the setting maximized the effectiveness of UPM in all examined cases.
The third parameter of the algorithm is , and it was introduced in Eq. 6. The value of which consistently led to satisfactory results was .
The next parameter to determine is the field weights of Eq. 6. The simplest solution here is to assign a fixed weight value to each field; for instance, one may consider that the model fields are twice as important as the field which contains the normal tokens. Although this approach delivers good results in some cases, it has two problems: i) the weights are set arbitrarily in an ad-hoc manner, and ii) a set of predefined field weights which works well in one case, may lead to poor performance in another.
For these reasons, we dropped the idea of assigning fixed values to the field weights. Instead, we discovered a function which leads to satisfactory performance in all cases:
[TABLE]
where is the total number of the distinct tokens of the product titles, and is an array with a size equal to five. Each entry in represents the number of tokens of the field which is associated to its index. For instance, in conjunction with the first column of Table 1, stores the population of , that is, the number of tokens in the title which represent an attribute of the product. Equation 8 implements the intuition that the more tokens a field contains, the less important its tokens are, and vice versa.
Finally, we determine the value of the parameter of Algorithm 4. Recall that this parameter controls the similarity threshold of an evicted product with a candidate cluster. The value which maximized performance was .
3 Experiments
This section analyzes the results of the experimental evaluation of the proposed algorithm. In particular, we compare UPM and UPM+ with two popular string similarity metrics, i.e, cosine similarity, and Jaccard index, as well as their enhanced versions, which include IDF token weights. Given two titles and , these metrics are defined as follows:
- •
cosine similarity: ,
- •
cosine similarity with IDF token weights:
[TABLE]
- •
Jaccard index: , and
- •
Jaccard index with IDF token weights:
[TABLE]
To ensure the robustness of our evaluation and to avoid results which were accidentally obtained, we based our experiments on multiple datasets. In particular, we crawled two popular product comparison platforms, PriceRunner111https://www.pricerunner.com/ and Skroutz222https://www.skroutz.gr/, and we constructed 8 datasets out of each one. Each of these 16 datasets represents a specific product category. The categories were selected with two criteria: i) to study the performance difference of the same methods on similar products which were provided by different vendors, and ii) to examine the effectiveness of the algorithms on products from diverse categories. For this reason, we include products from both identical and different categories in our experiments. Moreover, we created one aggregate dataset per platform, which contains all the products from all 8 categories combined. These datasets enable the examination of the performance on heterogeneous datasets.
To facilitate prices and features comparison, the platforms group the same products into clusters. These clusters were utilized to establish the ground-truth for the evaluation of the various methods. More specifically, similarly to UPM, both platforms consider that all the titles within a cluster represent the same product. Hence, each dataset is accompanied by a special “matches” file, which stores all the pairs of matching titles of all clusters. This file is subsequently used to verify the effectiveness of each method.
Table 2 presents the 18 experimental datasets accompanied by several useful characteristics. The first 9 rows concern the datasets which were crawled from PriceRunner, whereas the next 9 are about the ones which were acquired from Skroutz. The columns 2, 3, and 4 display the distinct number of vendors, products, and product titles of each dataset respectively. Moreover, the fifth column shows the average length of the titles; this parameter is important because it determines the value of according to eq. 7.
Unfortunately, we could not include results from the method of cikm2012 . This algorithm submits queries to Web search engines to i) enrich the product titles with important missing words (one query per title), and ii) to assign importance scores to the words of the enriched titles (one query per word pair, per title). If we applied this method on the Aggregate dataset of Skroutz (about titles and 7 words per title), the required number of queries would be 5.3 million. Clearly, this cost renders the method entirely unsustainable.
Moreover, notice that in cikm2012 , the proposed method is compared against only one similarity metric by employing only 2 small datasets. In contrast, here we evaluate UPM and UPM+ against 4 similarity metrics by using 18 datasets.
The experiments were conducted on a machine with an Intel CoreI7 [email protected] CPU and 32GB of RAM, running Ubuntu Linux 16.04 LTS. All methods were implemented in C++ and compiled by gcc with the -O3 speed optimization flag. We have made both this code and the datasets publicly available on GitHub333https://github.com/lakritidis/UPM to allow the interested researchers verify our results and work further on our findings.
3.1 Effectiveness Evaluation
The experimentation process is organized into two phases: In this subsection we study the effectiveness of the proposed algorithm, whereas in Subsection 3.2 we examine its efficiency. In both phases, the five parameters of the algorithm are fixed according to the discussion of Subsection 2.6.
UPM and UPM+ achieve product matching by generating clusters of similar products. To evaluate their output we applied the following methodology: Initially, we iterate through each cluster and for each product in the cluster, we create one pairwise match record with each of the rest of the products in the same cluster. In other words, we create a database with all the distinct product pairs within a cluster. In the sequel, we compare the records of this database with the ones of the aforementioned matches file and we count the number of true positives and negatives.
The matching quality was measured by employing the score, defined by , where and represent the values of Precision and the Recall, respectively.
Figure 2 illustrates the performance of UPM and UPM+ against the aforementioned methods, for the 9 datasets of PriceRunner. Each diagram depicts the fluctuation of the scores for various similarity thresholds ranging from 0.1 to 0.9. Recall that the similarity threshold determines whether two entities and match or not. That is, matches only if their similarity value exceeds . Since in Subsection 2.6 we fixed , the scores of UPM and UPM+ are represented by horizontal lines.
The first conclusion which derives from these diagrams is that in all datasets, the similarity metrics and with IDF token weights performed much better than their standard expressions and . For instance, in the Aggregate dataset of Fig. 2i, the effectiveness of was ; compared to the corresponding scores of and , this value was higher by 221% and 214% respectively. Similar differences are also observed for : its matching quality surpassed that of and by 219% and 211%. For this reason, we omit the commentation of the plain cosine similarity and Jaccard index in the discussion which follows.
UPM prevailed over its adversary approaches in all cases. The highest values were observed in the cases of Washing Machines (0.646), Refrigerators (0.645), Microwave Ovens (0.634), and Dishwashers (0.631) in Figures 2h, 2f, 2d, and 2c, respectively. The strongest opponent was , as its scores were 0.501 (-22%), 0.45 (-30%), 0.474 (-25%), and 0.504 (-20%) for the aforementioned datasets, respectively. The largest percentage difference was measured in the case of CPUs (Fig 2a), where our method achieved an value which was about 84% greater than the respective value of . On the other hand, the smallest difference was observed in the TVs dataset, and it was roughly 17%.
Furthermore, UPM won in the large and heterogeneous Aggregate dataset (Fig. 2i), since its was 0.547, compared to the value of 0.391 which was achieved by the latter method (that is, approximately 40% higher). Apart from the CPUs dataset, the results of Jaccard index were slightly worse than those of , consequently, UPM outperformed this metric by an even greater margin.
Regarding UPM+, in most cases, its effectiveness was very close to the one of UPM. Recall that UPM+ attempts to improve the execution time of the algorithm by processing only the first tokens of a product title, and by pruning the rest of them. For the datasets which contained Mobile Phones (Fig. 2e) and TVs (Fig. 2g), the two algorithms performed almost equally well. Additionally, for the Aggregate dataset, UPM is only 3.6% more accurate than UPM+. These measurements verify the theoretical foundation which predicted that only a portion of the words of the titles are important for the identification of a product.
The results indicate the superiority of UPM and UPM+ against their adversary methods in multiple types of products, and in the heterogeneous Aggregate dataset. The situation is improved even further in the datasets which originate from Skroutz (Fig. 3). In two cases, namely CPUs (Fig. 3d), and Refrigerators (Fig. 3f), UPM approached 100% precision, with being equal to 0.932 and 0.851 respectively. In the same datasets, the effectiveness of was 0.651 (-30%) and 0.568 (-33%) respectively, whereas achieved 0.606 (-35%) for CPUs, and 0.546 (-36%) for Refrigerators. Moreover, in the Aggregate dataset, UPM outperformed and by nearly 71% and 104% respectively.
Unlike the previous case, the performance of UPM+ was not so stable compared to UPM. In some datasets the two methods achieve product matching of almost equal quality, such as Air Conditioners (Fig. 3a) and Digital Cameras (Fig. 3e). However, there are occasions where the difference is larger, like the cases of Watches (Fig. 3h) and Aggregate datasets (Fig. 3i). Here UPM+ is inferior to UPM, by 53% and 26% respectively. Watches is the only case where UPM+ is defeated by cosine similarity, even marginally, by 6%.
Apart from their superior effectiveness, the proposed algorithms are also parameter-free, whereas the performance of the pairwise matching methods depends heavily on the selected similarity threshold value. In particular, the maximum effectiveness of was observed for four different values of : (e.g. CPUs in Fig. 3d), (e.g. Car Batteries in Fig. 3b), (e.g. Air Conditioners in Fig. 3a), and (e.g. Watches in Fig. 3h). A similar observation is also valid for the Jaccard index.
Four datasets, that is, CPUs, Digital Cameras, Refrigerators, and TVs have been crawled from both product comparison platforms. The examination of all methods on these datasets leads to the conclusion that the effectiveness does not primarily depend on the category itself. Instead, it is rather affected by how accurately the vendors describe their products. For instance, the score of UPM for the CPUs of PriceRunner and Skroutz was 0.579 and 0.932 respectively, a difference of about 61%. On the contrary, this difference was only 6% for the Digital Cameras. Moreover, UPM performed better on the Refrigerators rather than the CPUs of PriceRunner, whereas the opposite occurred on the corresponding datasets of Skroutz.
3.2 Efficiency Evaluation
This subsection contains the experimental measurements of the efficiency of the proposed algorithm in comparison with the aforementioned pairwise matching methods. In summary, the reported results demonstrate that in contrast to the other methods, both UPM and UPM+ are fast enough to be applied to all datasets, even to the larger ones.
Figure 4 depicts the running times (in seconds) of the six unsupervised product matching methods which participate in our evaluation. More specifically, the two diagrams illustrate the duration of the execution of these methods in the 9 datasets of PriceRunner (top diagram), and the 9 datasets of Skroutz (bottom diagram). The vertical axis of time is in logarithmic scale, to reliably display the large time differences between these executions.
Regarding the PriceRunner datasets, UPM+ was the fastest method among its adversaries, whereas the basic method, UPM, was ranked second. Notice that the larger the value of is, the greater the performance gap becomes. This is anticipated, since a high value leads to a big number of combinations to be extracted and scored. For instance, the average title length of Dishwashers was 7.591 (Table 2), therefore, we set according to eq. 7. For such small values of , UPM and UPM+ were equally fast (0.8 sec). On the other hand, for CPUs, where was equal to 5, UPM+ was more than 3 times faster than UPM (2.4 vs 8 sec). Similarly, for TVs where was also 5, UPM+ was 6.2 times faster than UPM (1.3 vs 8.1 sec).
Both UPM and UPM+ were substantially faster than the plain string similarity metrics. Notice that the larger the dataset is, the higher the performance gap becomes, due to the quadratic complexity of the pairwise matching procedure. The slowest methods were the ones which were the strongest opponents in terms of matching quality, that is, and . For instance, in the CPUs dataset UPM was 2.2 and 2.8 times faster than and respectively, whereas, UPM+ outperformed these metrics by 7.2 and 9.5 times. We will shortly discuss the Aggregate dataset.
The efficiency measurements were also positive for our proposed algorithms in the datasets which originated from Skroutz. Hence, in the case of Watches, UPM and UPM+ consumed equal times, and they were faster than , , and by 6.5, 18.1, 10.8, and 20.5 times respectively. Remarkably, in some datasets such as TVs, Refrigerators, and Cookers & Ovens, our algorithms were faster than the pairwise methods by more or less than two orders of magnitude.
Finally, Table 3 presents the execution times and the efficiency differences of the six examined methods on the two Aggregate datasets from PriceRunner and Skroutz. According to Table 2 and eq. 7, the first dataset was processed with , whereas the second with . Consequently, UPM+ achieved better times in the first case and it was faster than , , and by roughly 11.6, 28.8, 20.6 and 37.5 times respectively. The corresponding performance gaps in the Aggregate dataset of Skroutz were also very high, approximating 8.6, 24.5, 14.2 and 28.4 times respectively.
4 Conclusions
In this paper we introduced UPM, a clustering-based unsupervised algorithm for matching product titles from different data sources. This problem is particularly important for the e-commerce industry since it facilitates the comparison of product features and prices. UPM implements multiple novel elements, the most important of which are:
- •
it does not perform pairwise comparison of the titles, thus, it avoids the quadratic complexity of this procedure. Instead, it achieves matching by groupping the titles of identical products into clusters,
- •
it partially identifies the semantics of the title words,
- •
it includes a post-processing verification stage which corrects the erroneous matchings by moving products through clusters and by creating new clusters.
In addition, we introduced UPM+, a variant which prunes the titles and processes only a portion of their words.
The exhaustive experimental evaluation of UPM and UPM+ on 18 datasets from two product comparison platforms demonstrated their superiority over the traditional pairwise matching methods. More specifically in terms of matching quality, our method outperformed 4 similarity metrics by a margin of up to 84%. Furthermore, it was about 24–37 times faster than the pairwise matching methods in large datasets. In some cases, the performance was improved by more than two orders of magnitude.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Akritidis, L., Bozanis, P.: Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations. In: Proceedings of the 14th IEEE International Conference on Innovations in Intelligent Systems and Applications (INISTA), pp. 1–10 (2018)
- 2(2) Bär, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures. In: Proceedings of the 1st Joint Conference on Lexical and Computational Semantics, pp. 435–440 (2012)
- 3(3) Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection using Learnable String Similarity Measures. In: Proccedings of ACM SIGKDD, pp. 39–48 (2003)
- 4(4) Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: Proccedings of ACM SIGMOD, pp. 313–324 (2003)
- 5(5) Christen, P.: FEBRL: a Freely Available Record Linkage System with a Graphical User Interface. In: the 2nd Australasian Workshop on Health Data and Knowledge Management, pp. 17–25 (2008)
- 6(6) Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: a Survey. IEEE Transactions on Knowledge and Data Engineering 19 (1), 1–16 (2007)
- 7(7) Gomaa, W.H., Fahmy, A.A.: A Survey of Text Similarity Approaches. International Journal of Computer Applications 68 (13) (2013)
- 8(8) Gopalakrishnan, V., Iyengar, S.P., Madaan, A., Rastogi, R., Sengamedu, S.: Matching Product Titles using Web-based Enrichment. In: Proceedings of ACM CIKM, pp. 605–614 (2012)
