ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT
Miko{\l}aj Pokrywka, Wojciech Kusa, Mieszko Rutkowski, Miko{\l}aj Koszowski

TL;DR
This paper introduces ConECT, a new Czech-Polish e-commerce dataset with images and metadata, demonstrating that incorporating visual and contextual information improves translation quality in domain-specific machine translation tasks.
Contribution
The paper presents ConECT, a novel dataset for context-aware e-commerce translation, and evaluates methods showing that visual and contextual data enhance translation performance.
Findings
Visual context improves translation quality.
Contextual information like product categories enhances MT.
The dataset is publicly available for further research.
Abstract
Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
