A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models

Sergii Kozyrev; Davyd Maiboroda (Minima AI; Inc.)

arXiv:2602.01613·cs.LG·February 3, 2026

A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models

Sergii Kozyrev, Davyd Maiboroda (Minima AI, Inc.)

PDF

Open Access

TL;DR

This paper introduces Minima, a practical pipeline for compressing large language models using tensor decompositions, significantly reducing memory and increasing inference speed for production deployment.

Contribution

Minima presents a novel, production-ready compression pipeline combining sensitivity estimation, tensor decompositions, and custom kernels to optimize large language models for deployment.

Findings

01

Reduces peak VRAM from 64 GiB to 40 GiB on Qwen3-32B.

02

Increases throughput from 40 to 75 tokens/sec with Minima and speculative decoding.

03

Maintains effectiveness under high concurrency with multiple requests.

Abstract

Large language models are limited in deployment by GPU memory and inference latency. We present Minima, a production compression pipeline that learns where and how to structurally compress a Transformer and turns that compression into real serving gains. Minima trains a lightweight convolutional predictor to estimate layer- and patch-level sensitivity, applies a mixture of Tucker, tensor-train, and tensor-ring decompositions to low-sensitivity regions, performs a short healing fine-tune, and executes the resulting operators with custom Triton and CUDA kernels. The reduced memory footprint enables speculative decoding with a small draft model and a larger verifier. On Qwen3-32B at an 8k-token context window, Minima reduces peak VRAM from 64 GiB to 40 GiB. For a single active request, throughput increases from 40 tokens per second (baseline) to 50 tokens per second (Minima) and 75 tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Generative Adversarial Networks and Image Synthesis