Benchmarking pre-trained text embedding models in aligning built asset   information

Mehrzad Shahinmoghadam; Ali Motamedi

arXiv:2411.12056·cs.CL·November 20, 2024

Benchmarking pre-trained text embedding models in aligning built asset information

Mehrzad Shahinmoghadam, Ali Motamedi

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper benchmarks various pre-trained text embedding models to evaluate their effectiveness in aligning complex built asset technical data with domain-specific classifications, aiming to automate asset information mapping.

Contribution

It introduces a comprehensive benchmark and open-source library for evaluating text embeddings in built asset data alignment, addressing a gap in domain-specific semantic representation assessment.

Findings

01

Benchmarking across six datasets shows varied model performance.

02

Results highlight the need for domain adaptation techniques.

03

Open-source library supports future research in this area.

Abstract

Accurate mapping of the built asset information to established data classification systems and taxonomies is crucial for effective asset management, whether for compliance at project handover or ad-hoc data integration scenarios. Due to the complex nature of built asset data, which predominantly comprises technical text elements, this process remains largely manual and reliant on domain expert input. Recent breakthroughs in contextual text representation learning (text embedding), particularly through pre-trained large language models, offer promising approaches that can facilitate the automation of cross-mapping of the built asset data. However, no comprehensive evaluation has yet been conducted to assess these models' ability to effectively represent the complex semantics specific to built asset technical terminology. This study presents a comparative benchmark of state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mehrzadshm/built-bench-paper
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques