



GPT-Generated Unified Format for storing quantized model weights, designed for CPU inference and consumer hardware. Enables running LLMs on laptops and edge devices with flexible layer offloading to GPU.
GGUF (GPT-Generated Unified Format) is not a quantization technique itself, but a file format for storing quantized models optimized for CPU inference. It's the successor to GGML format and enables running large language models on consumer hardware.
GGUF Q4_K_M achieves 6.74 perplexity (close to baseline 6.56) while enabling deployment on consumer hardware. Best for CPU deployment and hardware flexibility.
Primary format for Ollama, LM Studio, and llama.cpp. Supported by many local LLM tools.
Free and open-source format. Many pre-quantized GGUF models available on Hugging Face.
Loading more......