4bit-Quantization in Vector-Embedding for RAG

Taehee Jeong

arXiv:2501.10534·cs.LG·January 22, 2025

4bit-Quantization in Vector-Embedding for RAG

Taehee Jeong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a 4-bit quantization method for vector embeddings in retrieval-augmented generation, significantly reducing memory usage and increasing search speed, thereby enabling more efficient deployment of RAG systems.

Contribution

It proposes a novel 4-bit quantization technique for embedding vectors in RAG, improving memory efficiency and computational speed over traditional high-precision methods.

Findings

01

Reduces memory storage of embeddings by 75%

02

Speeds up search processes in RAG systems

03

Maintains comparable retrieval accuracy with quantized vectors

Abstract

Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taeheej/4bit-quantization-in-vector-embedding-for-rag
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhotonic and Optical Devices · Sparse and Compressive Sensing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dense Connections · Adam · Softmax · Linear Warmup With Linear Decay · Residual Connection · Dropout · Byte Pair Encoding