ManufactuBERT: Efficient Continual Pretraining for Manufacturing
Robin Armingaud, Romaric Besan\c{c}on

TL;DR
ManufactuBERT is a domain-specific Transformer model pretrained on a curated manufacturing corpus, achieving state-of-the-art results and faster training times, demonstrating effective domain adaptation for manufacturing NLP tasks.
Contribution
The paper introduces ManufactuBERT, a specialized pretrained model for manufacturing, with a novel data processing pipeline that enhances training efficiency and performance.
Findings
ManufactuBERT outperforms existing baselines on manufacturing NLP tasks.
Deduplicated training data reduces training time by 33%.
The pipeline is reproducible for other specialized domains.
Abstract
While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Graph Neural Networks · Topic Modeling
