• Home
  • Categories
  • Tags
  • Pricing
  • Submit
    Decorative pattern
    1. Home
    2. Sdks & Libraries
    3. Infinity

    Infinity

    High-throughput, low-latency serving engine for text embeddings, reranking models, CLIP, CLAP and ColPali with GPU acceleration support for local deployment and production use.

    🌐Visit Website

    About this tool

    Overview

    Infinity is a high-throughput, low-latency REST API serving engine designed for deploying text-embeddings, reranking models, CLIP, CLAP, and ColPali models into production environments.

    Key Features

    GPU Acceleration

    • Built on top of torch, optimum (ONNX/TensorRT) and CTranslate2
    • Uses FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or Apple MPS accelerators
    • Multi-GPU support with --device-id 0,1,2,3 for approximately 4x throughput increase
    • Dynamic batching and tokenization in dedicated worker threads

    Docker Deployment

    docker run -it --gpus all \
      -v $volume:/app/.cache \
      -p $port:$port \
      michaelf34/infinity:latest \
      v2 \
      --model-id $model \
      --port $port
    

    Performance Optimization

    • Dynamic batching for improved throughput
    • Low-latency response times for production workloads
    • Efficient tokenization in worker threads
    • Support for both CPU and GPU deployments

    Supported Models

    • Text Embeddings: SentenceTransformers and compatible models
    • Reranking Models: Cross-encoder models for result reranking
    • CLIP: Contrastive language-image pretraining models
    • CLAP: Contrastive language-audio pretraining models
    • ColPali: Multi-vector retrieval models

    Installation & Deployment

    Docker (Recommended)

    • GPU: michaelf34/infinity:latest
    • CPU: michaelf34/infinity:latest-cpu

    CLI Installation

    pip install infinity-emb
    infinity_emb v2 --model-id <model> --port <port>
    

    Python API

    Use AsyncEmbeddingEngine for programmatic access with maximum flexibility

    API Compatibility

    • OpenAI-compatible API specifications
    • Swagger UI available at {url}:{port}/docs for testing
    • RESTful endpoints for easy integration

    Use Cases

    • Production embedding services for RAG applications
    • Real-time semantic search systems
    • Multi-modal search with CLIP/CLAP models
    • Reranking services for improved search relevance
    • Document retrieval with ColPali

    Pricing

    Free and open-source, available on GitHub.

    Surveys

    Loading more......

    Information

    Websitegithub.com
    PublishedMar 18, 2026

    Categories

    1 Item
    Sdks & Libraries

    Tags

    3 Items
    #Embeddings#Gpu Acceleration#Open Source

    Similar Products

    6 result(s)
    NVIDIA cuVS
    Featured

    GPU-accelerated vector search and clustering library from NVIDIA RAPIDS. Provides 8-12x faster index building and queries with multiple language support (C, C++, Python, Rust). This is an OSS library.

    Sentence Transformers v3.0

    Major update to the Sentence Transformers library introducing a new SentenceTransformerTrainer for easier fine-tuning, multi-GPU support, improved loss logging, and access to 15,000+ pre-trained models on HuggingFace.

    FlagEmbedding

    Open-source retrieval and RAG framework from BAAI featuring the BGE embedding model series. BGE-M3 supports multi-functionality (dense, sparse, multi-vector), multi-linguality (100+ languages), and multi-granularity (up to 8192 tokens).

    Qwen3 Embedding
    Featured

    Multilingual embedding model supporting over 100 languages and ranking #1 on MTEB multilingual leaderboard. Offers flexible model sizes from 0.6B to 8B parameters with user-defined instructions.

    BGE-M3

    A versatile multilingual text embedding model from BAAI that supports 100+ languages and can handle inputs up to 8192 tokens. BGE-M3 is unique in supporting three retrieval methods simultaneously: dense retrieval, multi-vector retrieval, and sparse retrieval.

    gte-Qwen2-1.5B-instruct

    A state-of-the-art multilingual text embedding model from Alibaba's GTE (General Text Embedding) series, built on the Qwen2-1.5B LLM. The model supports up to 8192 tokens and incorporates bidirectional attention mechanisms for enhanced contextual understanding across diverse domains.

    Decorative pattern
    Built with
    Ever Works
    Ever Works

    Connect with us

    Stay Updated

    Get the latest updates and exclusive content delivered to your inbox.

    Product

    • Categories
    • Tags
    • Pricing
    • Help

    Clients

    • Sign In
    • Register
    • Forgot password?

    Company

    • About Us
    • Admin
    • Sitemap

    Resources

    • Blog
    • Submit
    • API Documentation
    All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this repository, related repositories, and associated websites are for identification purposes only. The use of these names, logos, and brands does not imply endorsement, affiliation, or sponsorship. This directory may include content generated by artificial intelligence.
    Copyright © 2025 Awesome Vector Databases. All rights reserved.·Terms of Service·Privacy Policy·Cookies