CQ-CiM: Hardware-Aware Embedding Shaping for Robust CiM-Based Retrieval
Xinzhao Li, Alptekin Vardar, Franz M\"uller, Navya Goli, Umamaheswara Rao Tida, Kai Ni, Xiaobo Sharon Hu, Thomas K\"ampfe, Ruiyang Qin

TL;DR
CQ-CiM introduces a unified data shaping framework that jointly learns compression and quantization to produce low-bit, hardware-compatible embeddings, enabling more effective deployment of RAG on diverse compute-in-memory architectures.
Contribution
It is the first to jointly optimize data compression and quantization for diverse CiM architectures, improving RAG deployment efficiency.
Findings
Enhanced data fidelity in CiM-compatible embeddings.
Improved RAG performance on edge devices using CiM.
Unified framework supports diverse CiM implementations.
Abstract
Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
