4bit-Quantization in Vector-Embedding for RAG
Taehee Jeong

TL;DR
This paper introduces a 4-bit quantization method for vector embeddings in retrieval-augmented generation, significantly reducing memory usage and increasing search speed, thereby enabling more efficient deployment of RAG systems.
Contribution
It proposes a novel 4-bit quantization technique for embedding vectors in RAG, improving memory efficiency and computational speed over traditional high-precision methods.
Findings
Reduces memory storage of embeddings by 75%
Speeds up search processes in RAG systems
Maintains comparable retrieval accuracy with quantized vectors
Abstract
Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhotonic and Optical Devices · Sparse and Compressive Sensing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dense Connections · Adam · Softmax · Linear Warmup With Linear Decay · Residual Connection · Dropout · Byte Pair Encoding
