LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing
Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, and Yanxun Xu

TL;DR
LMAR introduces a versatile framework combining LLM-guided data synthesis and contrastive embedding adaptation to improve domain-specific knowledge retrieval efficiently and effectively.
Contribution
It presents a novel, model-agnostic pipeline that enhances retrieval performance in domain-specific contexts using synthetic data and clustering, with minimal hardware needs.
Findings
Outperforms baseline models on multiple domain datasets
Maintains low latency and moderate hardware requirements
Seamlessly integrates with existing RAG architectures
Abstract
Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Topic Modeling · Natural Language Processing Techniques
