# Automated Extraction of Fine-Grained Standardized Product Information   from Unstructured Multilingual Web Data

**Authors:** Alexander Flick, Sebastian J\"ager, Ivana Trajanovska, Felix, Biessmann

arXiv: 2302.12139 · 2023-02-24

## TL;DR

This paper presents a machine learning approach that leverages a multilingual dataset to extract detailed product information from unstructured web data, improving cross-shop and cross-language retrieval and taxonomy matching.

## Contribution

It introduces a novel transfer learning method for extracting fine-grained product attributes from multilingual, unstructured web data, enabling better cross-shop and multilingual product matching.

## Key findings

- Models reliably predict product attributes across shops and languages.
- The approach improves product taxonomy matching accuracy.
- Multilingual dataset enhances transfer learning capabilities.

## Abstract

Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.12139/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/2302.12139/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/2302.12139/full.md

---
Source: https://tomesphere.com/paper/2302.12139