Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
Junyi Hu, You Zhou, Jie Wang

TL;DR
This paper introduces the Overall Performance Index (OPI), an intrinsic metric for evaluating retrieval-augmented generation systems on deep-logic questions, and demonstrates its effectiveness through experiments with LangChain and various retrievers.
Contribution
The paper proposes the OPI metric for intrinsic evaluation of RAG systems and analyzes the performance of different retrievers using this new metric.
Findings
BERT embedding similarity correlates strongly with extrinsic scores.
BERT-based cosine similarity retriever outperforms others.
Combining multiple retrievers improves overall performance.
Abstract
We introduce the Overall Performance Index (OPI), an intrinsic metric to evaluate retrieval-augmented generation (RAG) mechanisms for applications involving deep-logic queries. OPI is computed as the harmonic mean of two key metrics: the Logical-Relation Correctness Ratio and the average of BERT embedding similarity scores between ground-truth and generated answers. We apply OPI to assess the performance of LangChain, a popular RAG tool, using a logical relations classifier fine-tuned from GPT-4o on the RAG-Dataset-12000 from Hugging Face. Our findings show a strong correlation between BERT embedding similarity scores and extrinsic evaluation scores. Among the commonly used retrievers, the cosine similarity retriever using BERT-based embeddings outperforms others, while the Euclidean distance-based retriever exhibits the weakest performance. Furthermore, we demonstrate that combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Attention Dropout · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Dropout · Byte Pair Encoding · BERT
