CommVQ

A commutative vector quantization method for KV cache compression that reduces FP16 cache size by 87.5% with 2-bit quantization and enables 1-bit quantization, allowing LLaMA-3.1 8B to run with 128K context on a single RTX 4090 GPU.

Visit Website

Surveys

Loading more......

Information

Websitemachinelearning.apple.com

PublishedMar 20, 2026

Tags

3 Items

#compression #quantization #llm-optimization

Similar Products

Residual Quantization with Implicit Neural Codebooks

ICML 2024 paper presenting a novel residual quantization approach using implicit neural codebooks for vector compression in high-dimensional similarity search, replacing traditional fixed codebooks with learned representations.

000

Leech Lattice Vector Quantization

Advanced vector quantization technique that explores the Leech lattice's optimal sphere packing properties at 24 dimensions. Delivers state-of-the-art LLM quantization performance, outperforming recent methods like Quip#, QTIP, and PVQ for extreme vector compression.

000

Binary Quantization for Vector Search

Compression technique that converts full-precision vectors to binary representations, achieving 32x storage reduction while maintaining 90-95% recall for efficient large-scale vector search.

000

Statistical Binary Quantization

Compression method developed by Timescale researchers that improves on standard Binary Quantization, reducing vector memory footprint by 32x while maintaining high accuracy for filtered searches.

000

BBQ Binary Quantization

Elasticsearch and Lucene's implementation of RaBitQ algorithm for 1-bit vector quantization, renamed as BBQ. Provides 32x compression with asymptotically optimal error bounds, enabling efficient vector search at massive scale with minimal accuracy loss.

000

Locally-Adaptive Vector Quantization

Advanced quantization technique that applies per-vector normalization and scalar quantization, adapting the quantization bounds individually for each vector. Achieves four-fold reduction in vector size while maintaining search accuracy with 26-37% overall memory footprint reduction.

000

CommVQ

Information

Categories

Tags

Similar Products

CommVQ

Information

Categories

Tags

Similar Products

Overview

Key Performance Results

Technical Innovation

Additive Quantization with Lightweight Encoder

Commutative with RoPE

Expectation-Maximization Training

Practical Impact

Enable Long-Context on Consumer GPUs

Extreme Compression Ratios

Implementation

Benchmarks

Long-Context Performance

GSM8K Mathematical Reasoning

Use Cases

Long Document Processing

Conversational AI

Edge Deployment

Cost Optimization

Authors and Affiliation

Publication Status

Impact on LLM Research

Comparison with Other Methods

vs. Scalar Quantization

vs. VQKV

vs. Token Pruning

Future Directions