ArcBERT: An LLM-based Search Engine for Exploring Integrated Multi-Omics Metadata

Gajendra Doniparthi; Shashank Balu Pandhare; Stefan De{\ss}loch; Timo M\"uhlhaus

arXiv:2512.15365·cs.DB·December 18, 2025

ArcBERT: An LLM-based Search Engine for Exploring Integrated Multi-Omics Metadata

Gajendra Doniparthi, Shashank Balu Pandhare, Stefan De{\ss}loch, Timo M\"uhlhaus

PDF

Open Access

TL;DR

ArcBERT is a novel LLM-based search engine that enables natural language querying and semantic understanding for exploring integrated multi-omics metadata in research data management systems.

Contribution

It introduces ArcBERT, a system that leverages domain-specific LLMs for natural language search and structural understanding of complex metadata hierarchies.

Findings

01

Enables natural language queries for metadata exploration

02

Uses semantic matching for improved search accuracy

03

Handles diverse user query patterns effectively

Abstract

Traditional search applications within Research Data Management (RDM) ecosystems are crucial in helping users discover and explore the structured metadata from the research datasets. Typically, text search engines require users to submit keyword-based queries rather than using natural language. However, using Large Language Models (LLMs) trained on domain-specific content for specialized natural language processing (NLP) tasks is becoming increasingly common. We present ArcBERT, an LLM-based system designed for integrated metadata exploration. ArcBERT understands natural language queries and relies on semantic matching, unlike traditional search applications. Notably, ArcBERT also understands the structure and hierarchies within the metadata, enabling it to handle diverse user querying patterns effectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Research Data Management Practices · Semantic Web and Ontologies