



Post-training quantization method for 4-bit weight compression that focuses on GPU inference performance. First quantization method to compress LLMs to 4-bit range while maintaining accuracy, minimizing mean squared error to weights.
Loading more......
GPTQ (Generative Pre-trained Transformer Quantization) is a pioneering post-training quantization method that compresses large language models to 4-bit precision while maintaining accuracy. It was the first method to successfully compress LLMs to the 4-bit range.
Provides significant speedup for GPU inference while maintaining model quality close to the original precision.
Supported by Hugging Face Transformers, vLLM, and other LLM serving frameworks. Models quantized with GPTQ can be easily loaded and deployed.
Free and open-source method. Pre-quantized models available on Hugging Face.