

High-throughput and memory-efficient open-source LLM inference engine with PagedAttention, continuous batching, and support for embedding model serving. Widely adopted for production-scale AI inference.
vLLM is an open-source inference engine optimized for large language models. It implements PagedAttention to manage KV cache efficiently and continuous batching to maximize GPU throughput. Though designed primarily for LLM inference, vLLM also supports embedding model serving.
Free and open-source under the Apache 2.0 license.
Loading more......