LGAI-EMBEDDING-Preview Technical Report

Jooyoung Choi; Hyun Kim; Hansol Jang; Changwook Jun; Kyunghoon Bae; Hyewon Choi; Stanley Jungkyu Choi; Honglak Lee; Chulmin Yun

arXiv:2506.07438·cs.CL·June 24, 2025

LGAI-EMBEDDING-Preview Technical Report

Jooyoung Choi, Hyun Kim, Hansol Jang, Changwook Jun, Kyunghoon Bae, Hyewon Choi, Stanley Jungkyu Choi, Honglak Lee, Chulmin Yun

PDF

1 Models

TL;DR

This paper introduces a unified instruction-based framework for learning versatile text embeddings using a decoder-only LLM, achieving strong performance across diverse IR and non-IR tasks without task-specific fine-tuning.

Contribution

It proposes a novel approach combining in-context learning, soft supervision, and adaptive hard-negative mining to generate high-quality, generalized text embeddings.

Findings

01

Achieves top performance on the MTEB benchmark across 41 tasks.

02

Outperforms larger or fully fine-tuned models in generalization.

03

Demonstrates effective use of soft relevance scores and adaptive negative sampling.

Abstract

This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
annamodels/LGAI-Embedding-Preview
model· 85 dl· ♡ 14
85 dl♡ 14

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.