• Home
  • Categories
  • Tags
  • Pricing
  • Submit
    Decorative pattern
    1. Home
    2. Research Papers & Surveys
    3. CommVQ

    CommVQ

    A commutative vector quantization method for KV cache compression that reduces FP16 cache size by 87.5% with 2-bit quantization and enables 1-bit quantization, allowing LLaMA-3.1 8B to run with 128K context on a single RTX 4090 GPU.

    🌐Visit Website

    About this tool

    Overview

    CommVQ (Commutative Vector Quantization) is a breakthrough method for KV cache compression in Large Language Models, accepted at ICML 2025. It addresses the critical memory bottleneck in long-context LLM inference.

    Key Performance Results

    • 87.5% memory reduction with 2-bit quantization
    • 1-bit KV cache quantization with minimal accuracy loss
    • 128K context length for LLaMA-3.1 8B on single RTX 4090
    • Outperforms state-of-the-art KV cache quantization methods
    • Maintains performance on long-context benchmarks and GSM8K

    Technical Innovation

    Additive Quantization with Lightweight Encoder

    Introduces additive quantization approach with:

    • Lightweight encoder architecture
    • Optimized codebook design
    • Simple matrix multiplication for decoding

    Commutative with RoPE

    Key innovation: Codebook is designed to be commutative with Rotary Position Embedding (RoPE):

    • Enables efficient position encoding
    • Maintains positional information after compression
    • Critical for long-context understanding

    Expectation-Maximization Training

    Codebook trained using EM algorithm:

    • Iterative optimization
    • Convergence to optimal quantization
    • No model fine-tuning required

    Practical Impact

    Enable Long-Context on Consumer GPUs

    Allows running 128K context LLaMA-3.1 8B on:

    • Single RTX 4090 (consumer GPU)
    • Dramatically lower hardware requirements
    • Accessible long-context LLM inference

    Extreme Compression Ratios

    • 2-bit quantization: 87.5% memory savings
    • 1-bit quantization: 93.75% memory savings
    • Minimal accuracy degradation

    Implementation

    • Open Source: Code available on GitHub
    • Repository: https://github.com/UMass-Embodied-AGI/CommVQ
    • Integration: Compatible with popular LLM frameworks
    • Easy Adoption: Drop-in replacement for standard KV cache

    Benchmarks

    Long-Context Performance

    Tested on long-context benchmarks showing:

    • Maintained accuracy across various tasks
    • Consistent performance at different context lengths
    • Better than existing quantization methods

    GSM8K Mathematical Reasoning

    • Preserved reasoning capabilities
    • Minimal degradation on complex tasks
    • Competitive with uncompressed models

    Use Cases

    Long Document Processing

    • Legal document analysis
    • Scientific paper comprehension
    • Book-length text understanding
    • Multi-document reasoning

    Conversational AI

    • Extended conversation history
    • Long-term context retention
    • Multi-turn dialogue systems
    • Context-aware responses

    Edge Deployment

    • On-device LLM inference
    • Mobile and IoT applications
    • Low-power AI systems
    • Privacy-preserving local inference

    Cost Optimization

    • Reduce cloud infrastructure costs
    • Lower memory bandwidth requirements
    • Improve serving throughput
    • More efficient batch processing

    Authors and Affiliation

    Developed by researchers from:

    • UMass Amherst (Embodied AGI Lab)
    • Apple Machine Learning Research
    • MIT
    • University of Toronto

    Authors: Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, and Chuang Gan

    Publication Status

    • Accepted: ICML 2025
    • Released: 2025-2026
    • Availability: Paper and code publicly available

    Impact on LLM Research

    CommVQ represents a significant advancement in:

    • Memory-efficient LLM inference
    • Long-context language modeling
    • Practical deployment of large models
    • Democratization of LLM technology

    Comparison with Other Methods

    vs. Scalar Quantization

    • Better compression at same bit width
    • Maintained accuracy at extreme compression
    • More sophisticated quantization strategy

    vs. VQKV

    • Specifically optimized for RoPE
    • Better performance on positional tasks
    • More efficient for transformer architectures

    vs. Token Pruning

    • Lossless (no information discarded)
    • Reversible compression
    • Better for retrieval tasks

    Future Directions

    • Integration with other efficiency techniques
    • Hardware-specific optimizations
    • Extension to other model architectures
    • Multi-modal model compression
    Surveys

    Loading more......

    Information

    Websitemachinelearning.apple.com
    PublishedMar 20, 2026

    Categories

    1 Item
    Research Papers & Surveys

    Tags

    3 Items
    #Compression#Quantization#Llm Optimization

    Similar Products

    6 result(s)
    VQKV

    A training-free vector quantization method for KV cache compression in Large Language Models that achieves 82.8% compression ratio on LLaMA3.1-8B while retaining 98.6% baseline performance and enabling 4.3x longer generation length on the same memory footprint.

    Leech Lattice Vector Quantization

    Advanced vector quantization technique that explores the Leech lattice's optimal sphere packing properties at 24 dimensions. Delivers state-of-the-art LLM quantization performance, outperforming recent methods like Quip#, QTIP, and PVQ for extreme vector compression.

    Statistical Binary Quantization

    Compression method developed by Timescale researchers that improves on standard Binary Quantization, reducing vector memory footprint by 32x while maintaining high accuracy for filtered searches.

    BBQ Binary Quantization

    Elasticsearch and Lucene's implementation of RaBitQ algorithm for 1-bit vector quantization, renamed as BBQ. Provides 32x compression with asymptotically optimal error bounds, enabling efficient vector search at massive scale with minimal accuracy loss.

    Locally-Adaptive Vector Quantization

    Advanced quantization technique that applies per-vector normalization and scalar quantization, adapting the quantization bounds individually for each vector. Achieves four-fold reduction in vector size while maintaining search accuracy with 26-37% overall memory footprint reduction.

    Anisotropic Vector Quantization

    An advanced quantization technique introduced by Google's ScaNN that prioritizes preserving parallel components between vectors rather than minimizing overall distance. Optimized for Maximum Inner Product Search (MIPS) and significantly improves retrieval accuracy.

    Decorative pattern
    Built with
    Ever Works
    Ever Works

    Connect with us

    Stay Updated

    Get the latest updates and exclusive content delivered to your inbox.

    Product

    • Categories
    • Tags
    • Pricing
    • Help

    Clients

    • Sign In
    • Register
    • Forgot password?

    Company

    • About Us
    • Admin
    • Sitemap

    Resources

    • Blog
    • Submit
    • API Documentation
    All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this repository, related repositories, and associated websites are for identification purposes only. The use of these names, logos, and brands does not imply endorsement, affiliation, or sponsorship. This directory may include content generated by artificial intelligence.
    Copyright © 2025 Awesome Vector Databases. All rights reserved.·Terms of Service·Privacy Policy·Cookies