Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish
Micha{\l} Mo\.zd\.zonek, Anna Wr\'oblewska, Sergiy Tkachuk, Szymon, {\L}ukasik

TL;DR
This paper demonstrates that multilingual Transformer models like mBERT and XLM-RoBERTa are effective for product matching tasks in English and Polish, introduces a new Polish dataset, and provides benchmark results for future research.
Contribution
It introduces the first open Polish dataset for product matching and evaluates multilingual Transformer models on both English and Polish data.
Findings
Multilingual Transformers perform comparably to state-of-the-art solutions.
Fine-tuned models achieve strong results on English product matching datasets.
Baseline results established for Polish product matching datasets.
Abstract
Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Web Data Mining and Analysis
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Multi-Head Attention · Absolute Position Encodings · Dropout
