CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design
Hasan Sinan Bank, Daniel R. Herber

TL;DR
CatalogBank is a new dataset that enhances engineering document processing by enabling better information extraction and interoperability through semi-automated annotation and multi-modal data integration.
Contribution
The paper introduces CatalogBank, a structured dataset with a semi-automatic annotation tool, facilitating improved NLP and document engineering tasks for engineering catalogs.
Findings
CatalogBank supports diverse document tasks like layout analysis and knowledge extraction.
The dataset enables automation in design workflows, reducing manual effort.
Baseline metrics demonstrate effective information extraction from PDF catalogs.
Abstract
In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces CatalogBank, a dataset developed to bridge the gap between textual descriptions and other data modalities related to engineering design catalogs. We utilized existing information extraction methodologies to extract product information from PDF-based catalogs to use in downstream tasks to generate a baseline metric. Our approach not only supports the potential automation of design workflows but also overcomes the limitations of manual data entry and non-standard metadata structures that have historically impeded the seamless integration of textual and other data modalities. Through the use of DocumentLabeler, an open-source annotation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Semantic Web and Ontologies
