Improving Embedding Accuracy for Document Retrieval Using Entity   Relationship Maps and Model-Aware Contrastive Sampling

Thea Aviss

arXiv:2410.18105·cs.IR·October 25, 2024

Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling

Thea Aviss

PDF

Open Access 1 Models

TL;DR

APEX-Embedding-7B introduces novel training techniques using entity relationship maps and model-aware contrastive sampling, significantly improving factual focus and retrieval accuracy in document retrieval tasks.

Contribution

The paper presents a new 7-billion parameter model with innovative training methods that enhance factual accuracy and retrieval performance for longer documents.

Findings

01

Achieved 90.86% rank@1 accuracy in retrieval tasks.

02

Reduced training data context size by 37.71%.

03

Set new state-of-the-art in document retrieval accuracy.

Abstract

In this paper we present APEX-Embedding-7B (Advanced Processing for Epistemic eXtraction), a 7-billion parameter decoder-only text Feature Extraction Model, specifically designed for Document Retrieval-Augmented Generation (RAG) tasks. Our approach employs two training techniques that yield an emergent improvement in factual focus: (1) Pre-convergence interrupted fine-tuning using Structured Entity Relationship Maps as training data input: designed to shift the model's attention and create a bias towards factual content rather than semantic style - this enhances plain text performance despite not being directly trained for it; and (2) Model-Aware Contrastive Sampling, creating a balanced and evenly distributed collation map of hard and soft negatives directly informed by the base model's competency. This combined methodology yields significant improvements, enhancing plain text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
5DAI/APEX-Embedding-7B-v0.1
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Data Mining Algorithms and Applications

MethodsSoftmax · Attention Is All You Need · Balanced Selection