



High-throughput, low-latency serving engine for text embeddings, reranking models, CLIP, CLAP and ColPali with GPU acceleration support for local deployment and production use.
Loading more......
Infinity is a high-throughput, low-latency REST API serving engine designed for deploying text-embeddings, reranking models, CLIP, CLAP, and ColPali models into production environments.
--device-id 0,1,2,3 for approximately 4x throughput increasedocker run -it --gpus all \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest \
v2 \
--model-id $model \
--port $port
michaelf34/infinity:latestmichaelf34/infinity:latest-cpupip install infinity-emb
infinity_emb v2 --model-id <model> --port <port>
Use AsyncEmbeddingEngine for programmatic access with maximum flexibility
{url}:{port}/docs for testingFree and open-source, available on GitHub.