Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data
Alexander Flick, Sebastian J\"ager, Ivana Trajanovska, Felix, Biessmann

TL;DR
This paper presents a machine learning approach that leverages a multilingual dataset to extract detailed product information from unstructured web data, improving cross-shop and cross-language retrieval and taxonomy matching.
Contribution
It introduces a novel transfer learning method for extracting fine-grained product attributes from multilingual, unstructured web data, enabling better cross-shop and multilingual product matching.
Findings
Models reliably predict product attributes across shops and languages.
The approach improves product taxonomy matching accuracy.
Multilingual dataset enhances transfer learning capabilities.
Abstract
Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.
| regular | multi- | GPC | size | |||
|---|---|---|---|---|---|---|
| updated | lingual | shop | family | |||
| Farfetch product meta data [9] | ✗ | ✗ | ✗ | ✗ | ✗ | 400K |
| Product details on Flipkart [3] | ✗ | ✗ | ✗ | ✓ | ✗ | 20K |
| Amazon browse node classification [2] | ✗ | ✗ | ✗ | ✓ | ✗ | 3M |
| Amazon product-question answering [16] | ✗ | ✗ | ✗ | ✓ | ✗ | 17.3GB |
| Rakuten data challenge [10] | ✗ | ✗ | ✗ | ✓ | ✗ | 1M |
| MAVE [18] | ✗ | ✗ | ✗ | ✓ | ✗ | 2.2M |
| Innerwear from victoria’s secret & co [15] | ✗ | ✗ | ✓ | ✗ | ✗ | 600K |
| WDC-MWPD [19] | ✗ | ✗ | ✓ | ✗ | ✓ | 16K |
| WDC-25 gold standard [14] | ✗ | ✗ | ✓ | ✓ | ✓ | 24K |
| GreenDB [7] | ✓ | ✓ | ✓ | ✓ | ✓ | 576K |
| Model | FR | DE | |||
|---|---|---|---|---|---|
| Asos | H&M | Otto | Amazon | ||
| Shop Transfer | 0.836 | 0.678 | - | - | |
| - | - | 0.777 | 0.648 | ||
| 0.842 | 0.717 | 0.762 | 0.739 | ||
| Shop & Language Transfer | - | - | 0.614 | 0.449 | |
| 0.795 | 0.666 | - | - | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Web Data Mining and Analysis · Sentiment Analysis and Opinion Mining
11institutetext: Berlin University of Applied Sciences and Technology 22institutetext: Einstein Center Digital Future, Berlin, Germany
22email: [email protected]
Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data
Alexander Flick 11 0000-0001-5273-0679
Sebastian Jäger 11 0000-0001-9420-8571
Ivana Trajanovska 11 0000-0003-0374-1210
Felix Biessmann 1122 0000-0002-3422-1026
Abstract
Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.
Keywords:
product information extraction e-commerce
1 Introduction
Recent research achievements in the field of machine learning (ML) [1, 13] have the potential to improve automated information extraction in applications such as e-commerce. However, the translation of these ML innovations into real-world application scenarios is impeded by the lack of publicly available data sets. Here we demonstrate that recent advances in ML can be translated into automated information extraction applications when leveraging carefully curated data. To better assess the contribution of this study, we first highlight some relevant data sets and methods that aim at the automated extraction of structured data in the field of e-commerce.
Public E-commerce Data Sets
We summarize publicly e-commerce data sets used for the automated extraction of product information in Table 1. To leverage the potential of ML, large and diverse data sets that follow a fine-grained product taxonomy are favorable. A common and detailed taxonomy is the Global Product Classification (GPC) standard, which ”classifies products by grouping them into categories based on their essential properties as well as their relationships to other products”[4]. For example, multiple Bricks (shirts and shorts) can belong to the same Family (clothing) but are different Classes (upper and lower body wear)111See the GPC Browser for more examples: https://gpc-browser.gs1.org/.
Multilingual Fine-Grained Product Classification
There are few recent studies investigating automated extraction of standardized product information in text corpora. Brinkmann et al. [1] study how hierarchical product classification benefits from domain-specific language modeling. They report an improvement of 0.012 weighted F1 score by using schema.org product222Website: https://schema.org/Product annotations for pre-training. Peeters et al. [12] study cross-language learning for entity matching and demonstrate that multilingual transformers outperform single-language models (German BERT) by 0.143 F1 when trained on a single language (German) and tested on multiple (German and English). Furthermore, using additional training data for the second language (English) improves the performance by another 0.038 weighted F1.
These studies highlight the potential of modern ML methods for automated product attribute extraction. In this work, we show that transfer learning helps to extract structured information (product category) from unstructured data (product name and description) and to find reliable taxonomy mappings.
2 Experiments
We evaluate three transfer learning scenarios for product classification:
Language Transfer: training on data of one language, test on other language data 2. 2.
Shop Transfer: training on data of one shop, test on other shop data 3. 3.
Language and Shop Transfer: training on data of one shop and one language, test on data of different shops and languages
Furthermore, we study whether ML methods can be used to find reliable taxonomy mappings. For this, we apply a model trained for a target taxonomy to data that uses a source taxonomy. For each source category, the majority of predicted target categories define the mapping from source to target taxonomy.
Data Sets
In our experiments, we use two data sets, the GreenDB [6] and the Farfetch data set [9]. The GreenDB333We use GreenDB version 0.2.2 available at https://zenodo.org/record/7225336 is a multilingual data set covering 5 European shops with about 576k unique products of the 37 most important product categories following the GPC taxonomy. It covers categories from the GPC segments Clothing, Footwear, Personal Accessories, Home Appliances, Audio Visual/Photography, and Computing. A recent publication [8] presents the GreenDB’s high quality and usefulness for information extraction tasks. The Farfetch data set has about 400k unique products from a single shop. It does not follow a public taxonomy and covers only fashion products.
ML Model
The experiment implementation is based on autogluon’s [17] TextPredictor and uses mDeBERTav3 [5] as the backbone model. For training, we use the GreenDB and apply Cleanlab [11] to find and remove miss-classified products (211 were found). Our models use the product’s name and description to predict their product category. is trained on the entire GreenDB (all shops), on the German, on the French, and on the German, French, and English Zalando products contained in the GreenDB.
Online Demo
To demonstrate the transfer capabilities, we published an online demo available: https://product-classification.demo.calgo-lab.de. As shown in Figure 1, it automatically downloads the HTML of a given URL, extracts the products’ name and description, and uses to predict its GPC category.
3 Results
The baseline performance () shows a strong 0.99 weighted F1 score on a GreenDB test set.
Transfer Tasks
demonstrates language transfer when it is applied to other languages of the same shop. It achieves weighted F1 scores of 0.898 for English and 0.873 for French. Applying and on other shops demonstrates shop transfer with weighted F1 scores from 0.648 to 0.836. If the model is fine-tuned on multi-lingual data (), almost all shops benefit, see Table 2 for details. The language and shop transfer is even more challenging and performs worse for all shops. Transferring across data sets, i.e., applying to Farfetch data, achieves a 0.924 weighted F1 score.
Taxonomy Matching
Using to map products’ categories from Farfetch to GreenDB (GPC taxonomy) results in 41 out of 46 (89%) correctly mapped categories.
4 Conclusion
We demonstrate that combining rich multilingual data sets and modern ML methods enables fine-grained standardized product information extraction from unstructured data. We investigate several transfer learning settings when training and testing on data from different shops and languages, even in zero-shot scenarios when no data from another shop and language was available in the training data.
Acknowledgements This research was supported by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety based on a decision of the German Bundestag.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Brinkmann, A., Bizer, C.: Improving Hierarchical Product Classification using Domain-specific Language Modelling. IEEE Data Eng. Bull. 44 (2), 14–25 (2021), http://sites.computer.org/debull/A 21june/p 14.pdf
- 2[2] Challenge, A.M.: (2022), https://www.hackerearth.com/en-us/challenges/competitive/amazon-ml-challenge/ , [Online; accessed 23-May-2022]
- 3[3] Flipkart: (2022), https://www.kaggle.com/Prompt Cloud HQ/flipkart-products , [Online; accessed 23-May-2022]
- 4[4] GS 1: Global Product Classification (GPC) | GS 1. https://www.gs 1.org/standards/gpc , [Online; accessed October 20, 2022]
- 5[5] He, P., Gao, J., Chen, W.: Debertav 3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. Co RR abs/2111.09543 (2021). https://doi.org/10.48550/arxiv.2111.09543
- 6[6] Jäger, S., Greene, J., Jakob, M., Korenke, R., Santarius, T., Biessmann, F.: Green DB: Toward a Product-by-Product Sustainability Database. Tech. rep., ar Xiv (May 2022). https://doi.org/10.48550/ar Xiv.2205.02908
- 7[7] Jäger, S., Bießmann, F., Flick, A., Sanchez Garcia, J.A., von den Driesch, K., Brendel, K.: Green DB: A Product-by-Product Sustainability Database (Feb 2022). https://doi.org/10.5281/zenodo.6576662, Supported by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety based on a decision of the German Bundestag. Förderkennzeichen: 67KI 2022 B
- 8[8] Jäger, S., Flick, A., Garcia, J.A.S., Driesch, K.v.d., Brendel, K., Biessmann, F.: Green DB - A Dataset and Benchmark for Extraction of Sustainability Information of Consumer Goods (Aug 2022). https://doi.org/10.48550/ar Xiv.2207.10733
