SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
Shashank Kapadia, Deep Narayan Mishra, Sujal Reddy Alugubelli, Ajay Kumar, Swapnil Yadav, Rishi Bhatia

TL;DR
SURGE is a GPU encoding system that efficiently processes large-scale, partitioned text data, achieving high throughput with reduced memory and fault tolerance, validated on real-world datasets.
Contribution
It introduces a cost model, memory-safety bounds, and a decision framework for resource-efficient GPU encoding of partitioned data, enabling faster and more memory-efficient processing.
Findings
Achieves 26,413 texts/sec on 10M texts with 4 GPUs.
Uses 12.6× less memory than fixed-batch methods.
Provides 68× faster time-to-first-output and crash recovery.
Abstract
We present SURGE, a streaming GPU encoding system deployed in production to generate embeddings for over 800 million texts across 40,000 logical partitions. Production embedding pipelines face a tension between logical data partitioning and efficient GPU utilization: processing each partition independently incurs inter-process communication (IPC) calls whose overhead limits throughput for compute-light models. Our contributions are analytical: (i) a cost model (Theorem 1) predicting throughput within 2% across three encoders spanning a 15 parameter range; (ii) a memory-safety bound (Lemma 3) enabling a streaming two-threshold policy with peak memory rather than ; and (iii) a /CV decision framework characterizing when the pattern applies beyond our workload. The naive fix of batching at fixed size requires peak memory (32.7 GB at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
