TL;DR
HUGO-CS is a large, manually curated dataset of cold spray experiments derived from scientific literature, utilizing a hybrid LLM-based extraction framework with uncertainty-aware manual review, enabling better process modeling.
Contribution
This work introduces HUGO-CS, the largest cold spray dataset to date, and a novel hybrid extraction framework combining automation and manual validation for scientific literature data.
Findings
HUGO-CS contains 4,383 experiments, 30 times larger than previous datasets.
The hybrid extraction framework achieves high accuracy with reduced manual effort.
The dataset and code are publicly available for benchmarking and research.
Abstract
Cold spraying is an increasingly common approach for repairing and manufacturing components due to its solid-state manufacturing capabilities. However, process optimization remains difficult due to many interdependent parameters and the lack of large-scale, machine-readable data to support modeling. While the scientific literature contains many relevant experiments, results are inconsistently reported (often in tables and figures) and use non-uniform units, limiting utilization at scale. To address these limitations, this work presents HUGO-CS, a literature-derived dataset of 4,383 cold-spray experiments with 144 features from 1,124 sources, exceeding the previous largest dataset (137 samples) by 30x. With completely manual extraction requiring an average of 91 minutes per document, this work designs and leverages a Hybrid-labeled, Uncertainty-aware, General-purpose, Observational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
