Automated Extraction of Fine-Grained Standardized Product Information   from Unstructured Multilingual Web Data

Alexander Flick; Sebastian J\"ager; Ivana Trajanovska; Felix; Biessmann

arXiv:2302.12139·cs.IR·February 24, 2023

Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data

Alexander Flick, Sebastian J\"ager, Ivana Trajanovska, Felix, Biessmann

PDF

Open Access

TL;DR

This paper presents a machine learning approach that leverages a multilingual dataset to extract detailed product information from unstructured web data, improving cross-shop and cross-language retrieval and taxonomy matching.

Contribution

It introduces a novel transfer learning method for extracting fine-grained product attributes from multilingual, unstructured web data, enabling better cross-shop and multilingual product matching.

Findings

01

Models reliably predict product attributes across shops and languages.

02

The approach improves product taxonomy matching accuracy.

03

Multilingual dataset enhances transfer learning capabilities.

Abstract

Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.

Tables2

Table 1. Table 1: Comparison of e-commerce data sets used for product attribute extraction and classification. Column GPC means whether or not the data set follows the GPC taxonomy.

	regular	multi-			GPC	size
	updated	lingual	shop	family	GPC	size
Farfetch product meta data [9]	✗	✗	✗	✗	✗	400K
Product details on Flipkart [3]	✗	✗	✗	✓	✗	20K
Amazon browse node classification [2]	✗	✗	✗	✓	✗	3M
Amazon product-question answering [16]	✗	✗	✗	✓	✗	17.3GB
Rakuten data challenge [10]	✗	✗	✗	✓	✗	1M
MAVE [18]	✗	✗	✗	✓	✗	2.2M
Innerwear from victoria’s secret & co [15]	✗	✗	✓	✗	✗	600K
WDC-MWPD [19]	✗	✗	✓	✗	✓	16K
WDC-25 gold standard [14]	✗	✗	✓	✓	✓	24K
GreenDB [7]	✓	✓	✓	✓	✓	$>$ 576K

Table 2. Table 2: Weighted F1 scores for shop transfer experiments. Scores from 0.648 to 0.836 demonstrate robust shop transfer. Shop transfer profits from additional data in other languages.

	Model	FR		DE
	Model	Asos	H&M	Otto	Amazon
Shop Transfer	$m o d e l_{Z a F R}$	0.836	0.678	-	-
	$m o d e l_{Z a D E}$	-	-	0.777	0.648
	$m o d e l_{Z a A L L}$	0.842	0.717	0.762	0.739
Shop & Language Transfer	$m o d e l_{Z a F R}$	-	-	0.614	0.449
Shop & Language Transfer	$m o d e l_{Z a D E}$	0.795	0.666	-	-

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Web Data Mining and Analysis · Sentiment Analysis and Opinion Mining

Full text

11institutetext: Berlin University of Applied Sciences and Technology 22institutetext: Einstein Center Digital Future, Berlin, Germany

22email: [email protected]

Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data

Alexander Flick 11 0000-0001-5273-0679

Sebastian Jäger 11 0000-0001-9420-8571

Ivana Trajanovska 11 0000-0003-0374-1210

Felix Biessmann 1122 0000-0002-3422-1026

Abstract

Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.

Keywords:

product information extraction e-commerce

1 Introduction

Recent research achievements in the field of machine learning (ML) [1, 13] have the potential to improve automated information extraction in applications such as e-commerce. However, the translation of these ML innovations into real-world application scenarios is impeded by the lack of publicly available data sets. Here we demonstrate that recent advances in ML can be translated into automated information extraction applications when leveraging carefully curated data. To better assess the contribution of this study, we first highlight some relevant data sets and methods that aim at the automated extraction of structured data in the field of e-commerce.

Public E-commerce Data Sets

We summarize publicly e-commerce data sets used for the automated extraction of product information in Table 1. To leverage the potential of ML, large and diverse data sets that follow a fine-grained product taxonomy are favorable. A common and detailed taxonomy is the Global Product Classification (GPC) standard, which ”classifies products by grouping them into categories based on their essential properties as well as their relationships to other products”[4]. For example, multiple Bricks (shirts and shorts) can belong to the same Family (clothing) but are different Classes (upper and lower body wear)111See the GPC Browser for more examples: https://gpc-browser.gs1.org/.

Multilingual Fine-Grained Product Classification

There are few recent studies investigating automated extraction of standardized product information in text corpora. Brinkmann et al. [1] study how hierarchical product classification benefits from domain-specific language modeling. They report an improvement of 0.012 weighted F1 score by using schema.org product222Website: https://schema.org/Product annotations for pre-training. Peeters et al. [12] study cross-language learning for entity matching and demonstrate that multilingual transformers outperform single-language models (German BERT) by 0.143 F1 when trained on a single language (German) and tested on multiple (German and English). Furthermore, using additional training data for the second language (English) improves the performance by another 0.038 weighted F1.

These studies highlight the potential of modern ML methods for automated product attribute extraction. In this work, we show that transfer learning helps to extract structured information (product category) from unstructured data (product name and description) and to find reliable taxonomy mappings.

2 Experiments

We evaluate three transfer learning scenarios for product classification:

Language Transfer: training on data of one language, test on other language data 2. 2.

Shop Transfer: training on data of one shop, test on other shop data 3. 3.

Language and Shop Transfer: training on data of one shop and one language, test on data of different shops and languages

Furthermore, we study whether ML methods can be used to find reliable taxonomy mappings. For this, we apply a model trained for a target taxonomy to data that uses a source taxonomy. For each source category, the majority of predicted target categories define the mapping from source to target taxonomy.

Data Sets

In our experiments, we use two data sets, the GreenDB [6] and the Farfetch data set [9]. The GreenDB333We use GreenDB version 0.2.2 available at https://zenodo.org/record/7225336 is a multilingual data set covering 5 European shops with about 576k unique products of the 37 most important product categories following the GPC taxonomy. It covers categories from the GPC segments Clothing, Footwear, Personal Accessories, Home Appliances, Audio Visual/Photography, and Computing. A recent publication [8] presents the GreenDB’s high quality and usefulness for information extraction tasks. The Farfetch data set has about 400k unique products from a single shop. It does not follow a public taxonomy and covers only fashion products.

ML Model

The experiment implementation is based on autogluon’s [17] TextPredictor and uses mDeBERTav3 [5] as the backbone model. For training, we use the GreenDB and apply Cleanlab [11] to find and remove miss-classified products (211 were found). Our models use the product’s name and description to predict their product category. $model_{baseline}$ is trained on the entire GreenDB (all shops), $model_{ZaDE}$ on the German, $model_{ZaFR}$ on the French, and $model_{ZaALL}$ on the German, French, and English Zalando products contained in the GreenDB.

Online Demo

To demonstrate the transfer capabilities, we published an online demo available: https://product-classification.demo.calgo-lab.de. As shown in Figure 1, it automatically downloads the HTML of a given URL, extracts the products’ name and description, and uses $model_{baseline}$ to predict its GPC category.

3 Results

The baseline performance ( $model_{baseline}$ ) shows a strong 0.99 weighted F1 score on a GreenDB test set.

Transfer Tasks

$model_{ZaDE}$ demonstrates language transfer when it is applied to other languages of the same shop. It achieves weighted F1 scores of 0.898 for English and 0.873 for French. Applying $model_{ZaFR}$ and $model_{ZaDE}$ on other shops demonstrates shop transfer with weighted F1 scores from 0.648 to 0.836. If the model is fine-tuned on multi-lingual data ( $model_{ZaALL}$ ), almost all shops benefit, see Table 2 for details. The language and shop transfer is even more challenging and performs worse for all shops. Transferring across data sets, i.e., applying $model_{baseline}$ to Farfetch data, achieves a 0.924 weighted F1 score.

Taxonomy Matching

Using $model_{baseline}$ to map products’ categories from Farfetch to GreenDB (GPC taxonomy) results in 41 out of 46 ( $>$ 89%) correctly mapped categories.

4 Conclusion

We demonstrate that combining rich multilingual data sets and modern ML methods enables fine-grained standardized product information extraction from unstructured data. We investigate several transfer learning settings when training and testing on data from different shops and languages, even in zero-shot scenarios when no data from another shop and language was available in the training data.

Acknowledgements This research was supported by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety based on a decision of the German Bundestag.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Brinkmann, A., Bizer, C.: Improving Hierarchical Product Classification using Domain-specific Language Modelling. IEEE Data Eng. Bull. 44 (2), 14–25 (2021), http://sites.computer.org/debull/A 21june/p 14.pdf
2[2] Challenge, A.M.: (2022), https://www.hackerearth.com/en-us/challenges/competitive/amazon-ml-challenge/ , [Online; accessed 23-May-2022]
3[3] Flipkart: (2022), https://www.kaggle.com/Prompt Cloud HQ/flipkart-products , [Online; accessed 23-May-2022]
4[4] GS 1: Global Product Classification (GPC) | GS 1. https://www.gs 1.org/standards/gpc , [Online; accessed October 20, 2022]
5[5] He, P., Gao, J., Chen, W.: Debertav 3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. Co RR abs/2111.09543 (2021). https://doi.org/10.48550/arxiv.2111.09543
6[6] Jäger, S., Greene, J., Jakob, M., Korenke, R., Santarius, T., Biessmann, F.: Green DB: Toward a Product-by-Product Sustainability Database. Tech. rep., ar Xiv (May 2022). https://doi.org/10.48550/ar Xiv.2205.02908
7[7] Jäger, S., Bießmann, F., Flick, A., Sanchez Garcia, J.A., von den Driesch, K., Brendel, K.: Green DB: A Product-by-Product Sustainability Database (Feb 2022). https://doi.org/10.5281/zenodo.6576662, Supported by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety based on a decision of the German Bundestag. Förderkennzeichen: 67KI 2022 B
8[8] Jäger, S., Flick, A., Garcia, J.A.S., Driesch, K.v.d., Brendel, K., Biessmann, F.: Green DB - A Dataset and Benchmark for Extraction of Sustainability Information of Consumer Goods (Aug 2022). https://doi.org/10.48550/ar Xiv.2207.10733