Little Giants: Synthesizing High-Quality Embedding Data at Scale

Haonan Chen; Liang Wang; Nan Yang; Yutao Zhu; Ziliang Zhao; Furu Wei,; Zhicheng Dou

arXiv:2410.18634·cs.CL·November 5, 2024

Little Giants: Synthesizing High-Quality Embedding Data at Scale

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei,, Zhicheng Dou

PDF

Open Access 1 Repo 3 Models 1 Video

TL;DR

This paper presents SPEED, a framework that enables small open-source models to generate high-quality synthetic embedding data efficiently, reducing reliance on expensive proprietary models and outperforming state-of-the-art methods.

Contribution

SPEED is a novel framework that aligns small models to produce high-quality synthetic embedding data, significantly reducing API costs and improving data quality.

Findings

01

SPEED outperforms GPT-4 based methods in embedding quality.

02

It uses less than 10% of GPT API calls.

03

The study reveals key factors affecting synthetic data quality.

Abstract

Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haon-chen/SPEED
pytorchOfficial

Models

Videos

Little Giants: Synthesizing High-Quality Embedding Data at Scale· underline

Taxonomy

TopicsData Mining Algorithms and Applications

MethodsLabel Smoothing · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Transformer · Multi-Head Attention · Linear Warmup With Cosine Annealing · Adam · Softmax