Better STEP, a format and dataset for boundary representation
Nafiseh Izadyar, Sai Chandra Madduri, Teseo Schneider

TL;DR
This paper introduces Better STEP, an open, efficient format and dataset for boundary representation data derived from CAD, enabling easier integration and processing in large-scale learning pipelines.
Contribution
It proposes a new HDF5-based format and dataset for STEP files, along with an open-source library, improving accessibility and usability for machine learning applications.
Findings
Successfully converted existing CAD datasets to the new format
Demonstrated effectiveness through four standard use cases
Ensured data integrity and fidelity to original STEP files
Abstract
Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available. However, the data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it. This dramatically limits their scope and usage in large learning pipelines, as it constrains the possibility of deploying them on computing clusters due to the high cost of per-node licenses. This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them. Our Python package also provides standard functionalities such as sampling, normals, and curvature to ease integration in existing pipelines. To demonstrate the effectiveness of our format, we converted the Fusion 360 dataset and the ABC dataset. We…
Peer Reviews
Decision·Submitted to ICLR 2026
HDF5 is a well supported and common format. Authors demonstrate some builtin functions using their python library. Converting different dataset into this version might be beneficial to open source? Although I have some difficulties understsanding what open source mean in this context. My understanding is that this python script is still built upon opencascade.
Paper lacks contribution in dataset and benchmark. No new dataset is introduced. Authors merely converted abc and fusion360 datasets into their "better step" format. Also there is no new data structure or more ML-friendly data representation for training. BRep is still represented by parametric faces, shells, edges, and vertices but with their parameters stored in hdf5 format. The topology is also still a linked list (top-down now). To me this doesn't really make the data any more "ML-friendly"
- A standard format for representing B-Reps from different sources in a consistent manner that can be easily utilized in Python, and a dataset in this format, is beneficial for the research and development of new approaches, simultaneously allowing for better and more consistent benchmarks and evaluation of these approaches. - The provided format is independent of the original (commonly proprietary) CAD file format, which allows combining different datasets. In addition, the proposed 'abs' libra
- Code listings are not clear. Listings 4 and 5 are used to replace the compute_labels function from Listing 2; however, inconsistent return formats (1/0 vs 1/None) are used between different examples. Additionally, Listing 3 does not provide any meaningful/helpful information. Pseudocode detailing how read_meshes/get_mesh worked would be more useful than the current provided code. In general, I think the provided code could be clearer, and more details could be provided in addition to the very
Bypassing proprietary kernels and version incompatibilities, it directly provides B-rep equivalent representations that can be consumed by ML frameworks; Provides a clear hierarchical structure (geometry/topology/mesh), standardized APIs (sampling, normals, curvature, topology traversal, etc.), and reports statistics such as conversion and failure rates; The same interface can generate data for multiple types of downstream tasks, with reasonable example coverage, showing plug-and-play suppor
The main problem is the insufficient demonstration of reproducibility and usability: there is currently no external demo, sample data or minimum runnable script, and reproduction must wait for formal acceptance and release, which has a high threshold; there is a lack of online browsing/interactive examples to demonstrate "ease of use" (such as visualization of typical B-rep, one-step sampling/export of point cloud); insufficient display of generated scenes - although the paper discusses the pote
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Manufacturing Process and Optimization · Advanced Numerical Analysis Techniques
