Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling
Thea Aviss

TL;DR
APEX-Embedding-7B introduces novel training techniques using entity relationship maps and model-aware contrastive sampling, significantly improving factual focus and retrieval accuracy in document retrieval tasks.
Contribution
The paper presents a new 7-billion parameter model with innovative training methods that enhance factual accuracy and retrieval performance for longer documents.
Findings
Achieved 90.86% rank@1 accuracy in retrieval tasks.
Reduced training data context size by 37.71%.
Set new state-of-the-art in document retrieval accuracy.
Abstract
In this paper we present APEX-Embedding-7B (Advanced Processing for Epistemic eXtraction), a 7-billion parameter decoder-only text Feature Extraction Model, specifically designed for Document Retrieval-Augmented Generation (RAG) tasks. Our approach employs two training techniques that yield an emergent improvement in factual focus: (1) Pre-convergence interrupted fine-tuning using Structured Entity Relationship Maps as training data input: designed to shift the model's attention and create a bias towards factual content rather than semantic style - this enhances plain text performance despite not being directly trained for it; and (2) Model-Aware Contrastive Sampling, creating a balanced and evenly distributed collation map of hard and soft negatives directly informed by the base model's competency. This combined methodology yields significant improvements, enhancing plain text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Data Mining Algorithms and Applications
MethodsSoftmax · Attention Is All You Need · Balanced Selection
