Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data   Generation and Evaluation using Reasoning Models

Aryan Jadon; Avinash Patil; Shashank Kumar

arXiv:2502.15854·cs.LG·February 25, 2025

Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models

Aryan Jadon, Avinash Patil, Shashank Kumar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new framework combining token-aware evaluation metrics and synthetic data generation with reasoning models to improve retrieval-augmented generation in technical domains, addressing current evaluation limitations.

Contribution

It develops token-aware metrics and a reasoning model-driven pipeline for generating domain-specific QA pairs, enhancing RAG system evaluation and performance in complex technical texts.

Findings

01

Smaller chunks (<10 tokens) improve precision by 31-42%.

02

Domain-specific embeddings cause 22% variance in optimal chunk size.

03

DeepSeek-R1-Distill-Qwen-32B outperforms alternatives in concept alignment.

Abstract

Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains requiring precise information extraction from complex documents. Current evaluation methodologies relying on document-level metrics inadequately capture token-resolution retrieval accuracy that is critical for domain-related documents. We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance. First, we introduce token-aware metrics Precision $Ω$ and Intersection-over-Union (IoU) that quantify context preservation versus information density trade-offs inherent in technical texts. Second, we develop a reasoning model-driven pipeline using instruction-tuned LLMs (DeepSeek-R1, DeepSeek-R1 distilled variants, and Phi-4) to generate context-anchored QA pairs with discontinuous reference spans across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aryan-jadon/synthetic-data-generation-and-evaluation-using-reasoning-models
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection