LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing

Yao Zhao; Yantian Ding; Zhiyue Zhang; Dapeng Yao; and Yanxun Xu

arXiv:2508.05672·cs.IR·September 15, 2025

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing

Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, and Yanxun Xu

PDF

Open Access

TL;DR

LMAR introduces a versatile framework combining LLM-guided data synthesis and contrastive embedding adaptation to improve domain-specific knowledge retrieval efficiently and effectively.

Contribution

It presents a novel, model-agnostic pipeline that enhances retrieval performance in domain-specific contexts using synthetic data and clustering, with minimal hardware needs.

Findings

01

Outperforms baseline models on multiple domain datasets

02

Maintains low latency and moderate hardware requirements

03

Seamlessly integrates with existing RAG architectures

Abstract

Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Topic Modeling · Natural Language Processing Techniques