Benchmarking pre-trained text embedding models in aligning built asset information
Mehrzad Shahinmoghadam, Ali Motamedi

TL;DR
This paper benchmarks various pre-trained text embedding models to evaluate their effectiveness in aligning complex built asset technical data with domain-specific classifications, aiming to automate asset information mapping.
Contribution
It introduces a comprehensive benchmark and open-source library for evaluating text embeddings in built asset data alignment, addressing a gap in domain-specific semantic representation assessment.
Findings
Benchmarking across six datasets shows varied model performance.
Results highlight the need for domain adaptation techniques.
Open-source library supports future research in this area.
Abstract
Accurate mapping of the built asset information to established data classification systems and taxonomies is crucial for effective asset management, whether for compliance at project handover or ad-hoc data integration scenarios. Due to the complex nature of built asset data, which predominantly comprises technical text elements, this process remains largely manual and reliant on domain expert input. Recent breakthroughs in contextual text representation learning (text embedding), particularly through pre-trained large language models, offer promising approaches that can facilitate the automation of cross-mapping of the built asset data. However, no comprehensive evaluation has yet been conducted to assess these models' ability to effectively represent the complex semantics specific to built asset technical terminology. This study presents a comparative benchmark of state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
