Infinity

High-throughput, low-latency serving engine for text embeddings, reranking models, CLIP, CLAP and ColPali with GPU acceleration support for local deployment and production use.

🌐Visit Website

About this tool

Overview

Infinity is a high-throughput, low-latency REST API serving engine designed for deploying text-embeddings, reranking models, CLIP, CLAP, and ColPali models into production environments.

Key Features

GPU Acceleration

Built on top of torch, optimum (ONNX/TensorRT) and CTranslate2
Uses FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or Apple MPS accelerators
Multi-GPU support with --device-id 0,1,2,3 for approximately 4x throughput increase
Dynamic batching and tokenization in dedicated worker threads

Docker Deployment

docker run -it --gpus all \
  -v $volume:/app/.cache \
  -p $port:$port \
  michaelf34/infinity:latest \
  v2 \
  --model-id $model \
  --port $port

Performance Optimization

Dynamic batching for improved throughput
Low-latency response times for production workloads
Efficient tokenization in worker threads
Support for both CPU and GPU deployments

Supported Models

Text Embeddings: SentenceTransformers and compatible models
Reranking Models: Cross-encoder models for result reranking
CLIP: Contrastive language-image pretraining models
CLAP: Contrastive language-audio pretraining models
ColPali: Multi-vector retrieval models

Installation & Deployment

Docker (Recommended)

GPU: michaelf34/infinity:latest
CPU: michaelf34/infinity:latest-cpu

CLI Installation

pip install infinity-emb
infinity_emb v2 --model-id <model> --port <port>

Python API

Use AsyncEmbeddingEngine for programmatic access with maximum flexibility

API Compatibility

OpenAI-compatible API specifications
Swagger UI available at {url}:{port}/docs for testing
RESTful endpoints for easy integration

Use Cases

Production embedding services for RAG applications
Real-time semantic search systems
Multi-modal search with CLIP/CLAP models
Reranking services for improved search relevance
Document retrieval with ColPali

Pricing

Free and open-source, available on GitHub.

Surveys

Loading more......

Information

Websitegithub.com

PublishedMar 18, 2026

Tags

3 Items

#Embeddings #Gpu Acceleration #Open Source

Similar Products

6 result(s)

NVIDIA cuVS

Featured

GPU-accelerated vector search and clustering library from NVIDIA RAPIDS. Provides 8-12x faster index building and queries with multiple language support (C, C++, Python, Rust). This is an OSS library.

Sentence Transformers v3.0

Major update to the Sentence Transformers library introducing a new SentenceTransformerTrainer for easier fine-tuning, multi-GPU support, improved loss logging, and access to 15,000+ pre-trained models on HuggingFace.

FlagEmbedding

Open-source retrieval and RAG framework from BAAI featuring the BGE embedding model series. BGE-M3 supports multi-functionality (dense, sparse, multi-vector), multi-linguality (100+ languages), and multi-granularity (up to 8192 tokens).

Qwen3 Embedding

Featured

Multilingual embedding model supporting over 100 languages and ranking #1 on MTEB multilingual leaderboard. Offers flexible model sizes from 0.6B to 8B parameters with user-defined instructions.

BGE-M3

A versatile multilingual text embedding model from BAAI that supports 100+ languages and can handle inputs up to 8192 tokens. BGE-M3 is unique in supporting three retrieval methods simultaneously: dense retrieval, multi-vector retrieval, and sparse retrieval.

gte-Qwen2-1.5B-instruct

A state-of-the-art multilingual text embedding model from Alibaba's GTE (General Text Embedding) series, built on the Qwen2-1.5B LLM. The model supports up to 8192 tokens and incorporates bidirectional attention mechanisms for enhanced contextual understanding across diverse domains.

Infinity

High-throughput, low-latency serving engine for text embeddings, reranking models, CLIP, CLAP and ColPali with GPU acceleration support for local deployment and production use.

🌐Visit Website

About this tool

Overview

Infinity is a high-throughput, low-latency REST API serving engine designed for deploying text-embeddings, reranking models, CLIP, CLAP, and ColPali models into production environments.

Key Features

GPU Acceleration

Built on top of torch, optimum (ONNX/TensorRT) and CTranslate2
Uses FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or Apple MPS accelerators
Multi-GPU support with --device-id 0,1,2,3 for approximately 4x throughput increase
Dynamic batching and tokenization in dedicated worker threads

Docker Deployment

docker run -it --gpus all \
  -v $volume:/app/.cache \
  -p $port:$port \
  michaelf34/infinity:latest \
  v2 \
  --model-id $model \
  --port $port

Performance Optimization

Dynamic batching for improved throughput
Low-latency response times for production workloads
Efficient tokenization in worker threads
Support for both CPU and GPU deployments

Supported Models

Text Embeddings: SentenceTransformers and compatible models
Reranking Models: Cross-encoder models for result reranking
CLIP: Contrastive language-image pretraining models
CLAP: Contrastive language-audio pretraining models
ColPali: Multi-vector retrieval models

Installation & Deployment

Docker (Recommended)

GPU: michaelf34/infinity:latest
CPU: michaelf34/infinity:latest-cpu

CLI Installation

pip install infinity-emb
infinity_emb v2 --model-id <model> --port <port>

Python API

Use AsyncEmbeddingEngine for programmatic access with maximum flexibility

API Compatibility

OpenAI-compatible API specifications
Swagger UI available at {url}:{port}/docs for testing
RESTful endpoints for easy integration

Use Cases

Production embedding services for RAG applications
Real-time semantic search systems
Multi-modal search with CLIP/CLAP models
Reranking services for improved search relevance
Document retrieval with ColPali

Pricing

Free and open-source, available on GitHub.

Surveys

Loading more......

Information

Websitegithub.com

PublishedMar 18, 2026

Infinity

About this tool

Overview

Key Features

GPU Acceleration

Docker Deployment

Performance Optimization

Supported Models

Installation & Deployment

Docker (Recommended)

CLI Installation

Python API

API Compatibility

Use Cases

Pricing

Information

Categories

Tags

Similar Products

Infinity

About this tool

Overview

Key Features

GPU Acceleration

Docker Deployment

Performance Optimization

Supported Models

Installation & Deployment

Docker (Recommended)

CLI Installation

Python API

API Compatibility

Use Cases

Pricing

Information

Categories

Tags

Similar Products