CPTQuant - A Novel Mixed Precision Post-Training Quantization Techniques   for Large Language Models

Amitash Nanda; Sree Bhargavi Balija; Debashis Sahoo

arXiv:2412.03599·cs.CL·December 10, 2024

CPTQuant - A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models

Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo

PDF

Open Access

TL;DR

CPTQuant introduces innovative mixed precision post-training quantization methods that significantly reduce large language models' size and computational demands while maintaining accuracy, through correlation, pruning, and Taylor decomposition techniques.

Contribution

The paper presents CPTQuant, a comprehensive framework combining three novel mixed precision quantization strategies tailored for large language models, enhancing compression and efficiency.

Findings

01

Up to 4x model compression with minimal accuracy loss.

02

PMPQ achieves higher compression ratios than existing methods.

03

TDMPQ attains 30% greater compression for language modeling tasks.

Abstract

Large language models have transformed the comprehension and generation of natural language tasks, but they come with substantial memory and computational requirements. Quantization techniques have emerged as a promising avenue for addressing these challenges while preserving accuracy and making energy efficient. We propose CPTQuant, a comprehensive strategy that introduces correlation-based (CMPQ), pruning-based (PMPQ), and Taylor decomposition-based (TDMPQ) mixed precision techniques. CMPQ adapts the precision level based on canonical correlation analysis of different layers. PMPQ optimizes precision layer-wise based on their sensitivity to sparsity. TDMPQ modifies precision using Taylor decomposition to assess each layer's sensitivity to input perturbation. These strategies allocate higher precision to more sensitive layers while diminishing precision to robust layers. CPTQuant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Softmax · Linear Warmup With Linear Decay · Multi-Head Attention · WordPiece · Dropout · Dense Connections · Layer Normalization · Weight Decay